Linear discriminant analysis (LDA) for dimensionality reduction while maximizing class separability
Dimensionality reduction can be achieved using various techniques. Eleven such techniques have already been discussed in my popular article, 11 Dimensionality reduction techniques you should know in 2021.
There, you will properly learn the meanings behind some technical terms such as dimensionality and dimensionality reduction.
In short, dimensionality refers to the number of features (variables) in the dataset. The process of reducing the features in the dataset is called dimensionality reduction.
Linear discriminant analysis (hereafter, LDA) is a popular linear dimensionality reduction technique that can find a linear combination of input features in a lower dimensional space while maximizing class separability.
Class separability simply means that we keep classes as far as possible while maintaining minimum separation between the data points within each class.
The better the separation of classes, the easier to draw decision boundaries between the classes to separate (discriminate) groups of data points.
LDA is often used with classification datasets that have class labels. It can be used both as a binary / multi-class classification (supervised learning) algorithm and a dimensionality reduction (unsupervised learning) algorithm.
However, class labels are needed for LDA when it is used for dimensionality reduction. Therefore, LDA performs supervised dimensionality reduction.
The fitted LDA model can be used for both classification and dimensionality reduction as follows.
from sklearn.discriminant_analysis import LinearDiscriminantAnalysislda = LinearDiscriminantAnalysis().fit(X, y) # fitted LDA model
lda.predict(X)
: Performs multi-class classification. This will assign new data points to the available classes.lda.transform(X)
: Performs dimensionality reduction in classification datasets while maximizing the separation of classes. This will find a linear combination of input features in a lower dimensional space. When lda is used in this way, it acts as a data preprocessing step in which its output is used as the input of another classification algorithm such as a support vector machine or logistic regression!
PCA is the most popular dimensionality reduction technique. Both PCA and LDA are considered linear dimensionality reduction techniques as they find a linear combination of input features in the data.
However, there are notable differences between the two algorithms.
- PCA performs dimensionality reduction by maximizing the variance of the data. Therefore, in most cases, feature standardization is necessary before applying PCA (see the exceptions here).
- LDA performs dimensionality reduction by maximizing the class separability of classification datasets. Therefore, feature standardization is optional here (we will verify this shortly).
- PCA does not require class labels. So, it can be used with classification, regression and even with unlabeled data!
- LDA reqiuress class labels. So, it is used with classification datasets.
- PCA finds a set of uncorrelated features in a lower dimensional space. Therefore, PCA automatically removes multicollinearity in the data (learn more here).
- As explained earlier, LDA can be used for both supervised and unsupervised tasks. PCA can only be used for unsupervised dimensionality reduction.
- The maximum number of components that PCA can find is equal to the number of input features (original dimensionality) of the dataset! We often prefer to find a considerably low number of components that captures as much of the variance in the original data as possible.
- The maximum number of components that LDA can find is equal to the number of classes minus one in the classification dataset. For example, if there are only 3 classes in the dataset, LDA can find the maximum of 2 components.
- LDA is more effective than PCA for classification datasets because LDA reduces the dimensionality of the data by maximizing class separability. It is easier to draw decision boundaries for data with maximum class separability.
Today, in this article, we will visually prove that LDA is highly effective than PCA for dimensionality reduction in classification datasets by using the Wine classification dataset. Then, we will move forward by discussing how the output of the LDA model can be used as the input of a classification algorithm such as a support vector machine or logistic regression!
Here, we will use the Wine classification dataset to perform PCA and LDA. Here is the important information about the Wine dataset.
- Dataset source: You can download the original dataset here.
- Dataset license: This dataset is available under the CC BY 4.0 (Creative Commons Attribution 4.0) license.
- Owners: Forina, M. et al, PARVUS — An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.
- Donor: Stefan Aeberhard
- Citation: Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.
The Wine dataset comes preloaded with Scikit-learn. It can be loaded by calling the load_wine() function as follows.
from sklearn.datasets import load_winewine = load_wine()
X = wine.data
y = wine.target
print("Wine dataset size:", X.shape)
The Wine dataset has 178 instances (data points). Its original dimensionality is 13 because it has 13 input features (variables). Moreover, the data points were divided into three separate classes that represent each Wine category.
We will apply PCA to the Wine data to achieve the following things.
- To build a PCA model which can be used to compare with the LDA model that will be created in the next steps.
- To visualize high dimensional (13-dim) Wine data in a 2D scatterplot using the first two principal components. You may already know that PCA is extremely useful for data visualization.
Feature standardization
Before applying PCA to the Wine data, we need to do feature standardization to get all features into the same scale.
from sklearn.preprocessing import StandardScalerX_scaled = StandardScaler().fit_transform(X)
The scaled features are stored in the X_scaled variable which will be the input for the pca.fit_transform()
method.
Running PCA
We will apply PCA to the Wine data by calling the Scikit-learn PCA() function. The number of components (specified in n_components
) that we want to keep is strictly limited to two since we are interested in 2D visualization of Wine data, which needs only two components!
from sklearn.decomposition import PCApca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
The transformed (reduced) data is stored in the X_pca variable which contains two-dimensional data that may accurately represent the original Wine data.
Making the scatterplot
Now, we will make the scatterplot using the data stored in the X_pca variable.
import matplotlib.pyplot as plt
plt.figure(figsize=[7, 5])plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, s=25, cmap='plasma')
plt.title('PCA for wine data with 2 components')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.savefig("PCA.png")
The data appears to be linearly separable with linear decision boundaries (i.e. straight lines). However, some data points would be misclassified. Classes have not been well separated since PCA doesn’t maximize class separability.
When applying PCA to the Wine data by keeping only two components, we lost a significant amount of variance in the data.
exp_var = sum(pca.explained_variance_ratio_ * 100)
print('Variance explained:', exp_var)
Only about 55.4% of the variance was captured by our PCA model with two components. That much of variance is not good enough to accurately represent the original data.
Let’s find the optimal number of principal components for the Wine data by creating the following plot. The value should be greater than 2 but less than 13 (number of input features).
import numpy as nppca = PCA(n_components=None)
X_pca = pca.fit_transform(X_scaled)
exp_var = pca.explained_variance_ratio_ * 100
cum_exp_var = np.cumsum(exp_var)
plt.bar(range(1, 14), exp_var, align='center',
label='Individual explained variance')
plt.step(range(1, 14), cum_exp_var, where='mid',
label='Cumulative explained variance', color='red')
plt.ylabel('Explained variance percentage')
plt.xlabel('Principal component index')
plt.xticks(ticks=list(range(1, 14)))
plt.legend(loc='best')
plt.tight_layout()
plt.savefig("Barplot_PCA.png")
This type of plot is called the cumulative explained variance plot and is extremely useful to find the optimal number of principal components when applying PCA.
The first six or seven components capture about 85–90% of the variance in the data. So, they will accurately represent the original Wine data. But, for a 2D visualization, we strictly want to use only two components even though they don’t capture much of the variance in the data.
Now, we will apply LDA to the Wine data and compare the LDA model with the previous PCA model.
Feature standardization
Feature standardization is not needed for LDA as it does not have any effect on the performance of the LDA model.
Running LDA
We will apply LDA to the Wine data by calling the Scikit-learn LinearDiscriminantAnalysis() function. The number of components (specified in n_components
) that we want to keep is strictly limited to two since we are interested in 2D visualization of Wine data, which needs only two components!
from sklearn.discriminant_analysis import LinearDiscriminantAnalysislda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)
Note that for LDA, class labels (y) are also needed in the fit_transform()
method.
The transformed (reduced) data is stored in the X_lda variable which contains two-dimensional data that may accurately represent the original Wine data.
Making the scatterplot
Now, we will make the scatterplot using the data stored in the X_lda variable.
import matplotlib.pyplot as plt
plt.figure(figsize=[7, 5])plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, s=25, cmap='plasma')
plt.title('LDA for wine data with 2 components')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.savefig("LDA.png")
Now, the classes have been clearly separated since LDA maximizes class separability in addition to reducing dimensionality. The data points will not be misclassified when drawing linear decision boundaries.
The maximum number of components that LDA can keep for Wine data is also two because there are only three classes in the data. So, these two components should capture all the variance in the data.
Let’s verify this numerically and visually!
exp_var = sum(lda.explained_variance_ratio_ * 100)
print('Variance explained:', exp_var)
All variance in the original Wine data was captured by our LDA model with two components. So, these components will fully represent the original data.
Let’s create the cumulative explained variance plot for the LDA model.
import numpy as nplda = LinearDiscriminantAnalysis(n_components=None)
X_lda = lda.fit(X_scaled, y)
exp_var = lda.explained_variance_ratio_ * 100
cum_exp_var = np.cumsum(exp_var)
plt.bar(range(1, 3), exp_var, align='center',
label='Individual explained variance')
plt.step(range(1, 3), cum_exp_var, where='mid',
label='Cumulative explained variance', color='red')
plt.ylabel('Explained variance percentage')
plt.xlabel('Component index')
plt.xticks(ticks=[1, 2])
plt.legend(loc='best')
plt.tight_layout()
plt.savefig("Barplot_LDA.png")
The first two components capture all the variance in the data. So, they fully represent the original Wine data.
Our LDA model has the following benefits.
- Reducing the dimensionality (number of features) in the data
- Visualizing high-dimensional data in a 2D plot
- Maximizing class separability
Only the LDA can maximize class separability while reducing the dimensionality of the data. So, LDA is ideal for reducing dimensionality before running another classification algorithm such as a support vector machine (SVM) or logistic regression.
This can be visualized as follows.
The LDA model takes high-dimensional (13-dim) Wine data (X) as its input and reduces the dimensionality of the data while maximizing class separability. The transformed data (2-dim) which is X_LDA is used as the input of the SVM model along with the class labels, y. Then, the SVM performs multi-class classification since Wine data has 3 classes by using the One-vs-Rest (‘ovr’) strategy which will allow the algorithm to draw decision boundaries for each class considering all other classes.
We use the ‘linear’ kernel as the kernel of the support vector machine algorithm because the data appears to be linearly separable with linear decision boundaries (i.e. straight lines).
from sklearn.svm import SVCsvc = SVC(kernel='linear', decision_function_shape='ovr')
svc.fit(X_lda, y)
Classes can be clearly separated with linear decision boundaries. Only one data point was misclassified. Compare this with the PCA output that we obtained earlier, there, many points will be misclassified if we would draw linear decision boundaries.
So, LDA is highly effective than PCA for dimensionality reduction in classification datasets because LDA maximizes class separability while reducing the dimensionality of the data.
Kindly note that the code for drawing SVM decision boundaries (hyperplanes) is not included in the above code as it requires a thorough understanding of how SVMs work behind the scenes, which is beyond the scope of this article.
Note: The term ‘hyperplane’ is the correct way to refer to the decision boundary when considering high-dimensional data. In a two-dimensional space, a hyperplane is just a straight line. Likewise, you can imagine the shape of a hyperplane in higher-dimensional space.
The most important hyperparameter in both PCA and LDA algorithms is n_components in which we specify the number of components that LDA or PCA should find.
The guidelines for selecting the best number of components for PCA have already been discussed in my article, How to Select the Best Number of Principal Components for the Dataset.
The same guidelines are also valid for LDA.
- If the sole purpose of applying LDA is for data visualization, you should keep 2 (for 2D plots) or 3 (for 3D plots) components. We are only familiar with 2D and 3D plots and we can’t imagine other high-dimensional plots.
- As I explained earlier, the cumulative explained variance plot is extremely useful to choose the right number of components.
- The maximum number of components that LDA can find is equal to the number of classes minus one in the classification dataset.