Machine Learning News Hubb
Advertisement Banner
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us
Machine Learning News Hubb
No Result
View All Result
Home Artificial Intelligence

LDA Is More Effective than PCA for Dimensionality Reduction in Classification Datasets | by Rukshan Pramoditha | Dec, 2022

admin by admin
January 4, 2023
in Artificial Intelligence


Linear discriminant analysis (LDA) for dimensionality reduction while maximizing class separability

Photo by Will Francis on Unsplash

Dimensionality reduction can be achieved using various techniques. Eleven such techniques have already been discussed in my popular article, 11 Dimensionality reduction techniques you should know in 2021.

There, you will properly learn the meanings behind some technical terms such as dimensionality and dimensionality reduction.

In short, dimensionality refers to the number of features (variables) in the dataset. The process of reducing the features in the dataset is called dimensionality reduction.

Linear discriminant analysis (hereafter, LDA) is a popular linear dimensionality reduction technique that can find a linear combination of input features in a lower dimensional space while maximizing class separability.

Class separability simply means that we keep classes as far as possible while maintaining minimum separation between the data points within each class.

The better the separation of classes, the easier to draw decision boundaries between the classes to separate (discriminate) groups of data points.

LDA is often used with classification datasets that have class labels. It can be used both as a binary / multi-class classification (supervised learning) algorithm and a dimensionality reduction (unsupervised learning) algorithm.

However, class labels are needed for LDA when it is used for dimensionality reduction. Therefore, LDA performs supervised dimensionality reduction.

The fitted LDA model can be used for both classification and dimensionality reduction as follows.

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis().fit(X, y) # fitted LDA model

  • lda.predict(X): Performs multi-class classification. This will assign new data points to the available classes.
  • lda.transform(X): Performs dimensionality reduction in classification datasets while maximizing the separation of classes. This will find a linear combination of input features in a lower dimensional space. When lda is used in this way, it acts as a data preprocessing step in which its output is used as the input of another classification algorithm such as a support vector machine or logistic regression!

PCA is the most popular dimensionality reduction technique. Both PCA and LDA are considered linear dimensionality reduction techniques as they find a linear combination of input features in the data.

However, there are notable differences between the two algorithms.

  • PCA performs dimensionality reduction by maximizing the variance of the data. Therefore, in most cases, feature standardization is necessary before applying PCA (see the exceptions here).
  • LDA performs dimensionality reduction by maximizing the class separability of classification datasets. Therefore, feature standardization is optional here (we will verify this shortly).
  • PCA does not require class labels. So, it can be used with classification, regression and even with unlabeled data!
  • LDA reqiuress class labels. So, it is used with classification datasets.
  • PCA finds a set of uncorrelated features in a lower dimensional space. Therefore, PCA automatically removes multicollinearity in the data (learn more here).
  • As explained earlier, LDA can be used for both supervised and unsupervised tasks. PCA can only be used for unsupervised dimensionality reduction.
  • The maximum number of components that PCA can find is equal to the number of input features (original dimensionality) of the dataset! We often prefer to find a considerably low number of components that captures as much of the variance in the original data as possible.
  • The maximum number of components that LDA can find is equal to the number of classes minus one in the classification dataset. For example, if there are only 3 classes in the dataset, LDA can find the maximum of 2 components.
  • LDA is more effective than PCA for classification datasets because LDA reduces the dimensionality of the data by maximizing class separability. It is easier to draw decision boundaries for data with maximum class separability.

Today, in this article, we will visually prove that LDA is highly effective than PCA for dimensionality reduction in classification datasets by using the Wine classification dataset. Then, we will move forward by discussing how the output of the LDA model can be used as the input of a classification algorithm such as a support vector machine or logistic regression!

Here, we will use the Wine classification dataset to perform PCA and LDA. Here is the important information about the Wine dataset.

  • Dataset source: You can download the original dataset here.
  • Dataset license: This dataset is available under the CC BY 4.0 (Creative Commons Attribution 4.0) license.
  • Owners: Forina, M. et al, PARVUS — An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.
  • Donor: Stefan Aeberhard
  • Citation: Lichman, M. (2013). UCI Machine Learning Repository [https://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The Wine dataset comes preloaded with Scikit-learn. It can be loaded by calling the load_wine() function as follows.

from sklearn.datasets import load_wine

wine = load_wine()
X = wine.data
y = wine.target

print("Wine dataset size:", X.shape)

(Image by author)

The Wine dataset has 178 instances (data points). Its original dimensionality is 13 because it has 13 input features (variables). Moreover, the data points were divided into three separate classes that represent each Wine category.

We will apply PCA to the Wine data to achieve the following things.

  • To build a PCA model which can be used to compare with the LDA model that will be created in the next steps.
  • To visualize high dimensional (13-dim) Wine data in a 2D scatterplot using the first two principal components. You may already know that PCA is extremely useful for data visualization.

Feature standardization

Before applying PCA to the Wine data, we need to do feature standardization to get all features into the same scale.

from sklearn.preprocessing import StandardScaler

X_scaled = StandardScaler().fit_transform(X)

The scaled features are stored in the X_scaled variable which will be the input for the pca.fit_transform()method.

Running PCA

We will apply PCA to the Wine data by calling the Scikit-learn PCA() function. The number of components (specified in n_components) that we want to keep is strictly limited to two since we are interested in 2D visualization of Wine data, which needs only two components!

from sklearn.decomposition import PCA

pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

The transformed (reduced) data is stored in the X_pca variable which contains two-dimensional data that may accurately represent the original Wine data.

Making the scatterplot

Now, we will make the scatterplot using the data stored in the X_pca variable.

import matplotlib.pyplot as plt
plt.figure(figsize=[7, 5])

plt.scatter(X_pca[:, 0], X_pca[:, 1], c=y, s=25, cmap='plasma')
plt.title('PCA for wine data with 2 components')
plt.xlabel('PC1')
plt.ylabel('PC2')
plt.savefig("PCA.png")

2D scatterplot for Wine data with PCA (Image by author)

The data appears to be linearly separable with linear decision boundaries (i.e. straight lines). However, some data points would be misclassified. Classes have not been well separated since PCA doesn’t maximize class separability.

When applying PCA to the Wine data by keeping only two components, we lost a significant amount of variance in the data.

exp_var = sum(pca.explained_variance_ratio_ * 100)
print('Variance explained:', exp_var)
(Image by author)

Only about 55.4% of the variance was captured by our PCA model with two components. That much of variance is not good enough to accurately represent the original data.

Let’s find the optimal number of principal components for the Wine data by creating the following plot. The value should be greater than 2 but less than 13 (number of input features).

import numpy as np

pca = PCA(n_components=None)
X_pca = pca.fit_transform(X_scaled)

exp_var = pca.explained_variance_ratio_ * 100
cum_exp_var = np.cumsum(exp_var)

plt.bar(range(1, 14), exp_var, align='center',
label='Individual explained variance')

plt.step(range(1, 14), cum_exp_var, where='mid',
label='Cumulative explained variance', color='red')

plt.ylabel('Explained variance percentage')
plt.xlabel('Principal component index')
plt.xticks(ticks=list(range(1, 14)))
plt.legend(loc='best')
plt.tight_layout()

plt.savefig("Barplot_PCA.png")

(Image by author)

This type of plot is called the cumulative explained variance plot and is extremely useful to find the optimal number of principal components when applying PCA.

The first six or seven components capture about 85–90% of the variance in the data. So, they will accurately represent the original Wine data. But, for a 2D visualization, we strictly want to use only two components even though they don’t capture much of the variance in the data.

Now, we will apply LDA to the Wine data and compare the LDA model with the previous PCA model.

Feature standardization

Feature standardization is not needed for LDA as it does not have any effect on the performance of the LDA model.

Running LDA

We will apply LDA to the Wine data by calling the Scikit-learn LinearDiscriminantAnalysis() function. The number of components (specified in n_components) that we want to keep is strictly limited to two since we are interested in 2D visualization of Wine data, which needs only two components!

from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

lda = LinearDiscriminantAnalysis(n_components=2)
X_lda = lda.fit_transform(X_scaled, y)

Note that for LDA, class labels (y) are also needed in the fit_transform()method.

The transformed (reduced) data is stored in the X_lda variable which contains two-dimensional data that may accurately represent the original Wine data.

Making the scatterplot

Now, we will make the scatterplot using the data stored in the X_lda variable.

import matplotlib.pyplot as plt
plt.figure(figsize=[7, 5])

plt.scatter(X_lda[:, 0], X_lda[:, 1], c=y, s=25, cmap='plasma')
plt.title('LDA for wine data with 2 components')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.savefig("LDA.png")

2D scatterplot for Wine data with LDA (Image by author)

Now, the classes have been clearly separated since LDA maximizes class separability in addition to reducing dimensionality. The data points will not be misclassified when drawing linear decision boundaries.

The maximum number of components that LDA can keep for Wine data is also two because there are only three classes in the data. So, these two components should capture all the variance in the data.

Let’s verify this numerically and visually!

exp_var = sum(lda.explained_variance_ratio_ * 100)
print('Variance explained:', exp_var)
(Image by author)

All variance in the original Wine data was captured by our LDA model with two components. So, these components will fully represent the original data.

Let’s create the cumulative explained variance plot for the LDA model.

import numpy as np

lda = LinearDiscriminantAnalysis(n_components=None)
X_lda = lda.fit(X_scaled, y)

exp_var = lda.explained_variance_ratio_ * 100
cum_exp_var = np.cumsum(exp_var)

plt.bar(range(1, 3), exp_var, align='center',
label='Individual explained variance')

plt.step(range(1, 3), cum_exp_var, where='mid',
label='Cumulative explained variance', color='red')

plt.ylabel('Explained variance percentage')
plt.xlabel('Component index')
plt.xticks(ticks=[1, 2])
plt.legend(loc='best')
plt.tight_layout()

plt.savefig("Barplot_LDA.png")

(Image by author)

The first two components capture all the variance in the data. So, they fully represent the original Wine data.

Our LDA model has the following benefits.

  • Reducing the dimensionality (number of features) in the data
  • Visualizing high-dimensional data in a 2D plot
  • Maximizing class separability

Only the LDA can maximize class separability while reducing the dimensionality of the data. So, LDA is ideal for reducing dimensionality before running another classification algorithm such as a support vector machine (SVM) or logistic regression.

This can be visualized as follows.

Machine learning pipeline for SVM with LDA (Image by author)

The LDA model takes high-dimensional (13-dim) Wine data (X) as its input and reduces the dimensionality of the data while maximizing class separability. The transformed data (2-dim) which is X_LDA is used as the input of the SVM model along with the class labels, y. Then, the SVM performs multi-class classification since Wine data has 3 classes by using the One-vs-Rest (‘ovr’) strategy which will allow the algorithm to draw decision boundaries for each class considering all other classes.

We use the ‘linear’ kernel as the kernel of the support vector machine algorithm because the data appears to be linearly separable with linear decision boundaries (i.e. straight lines).

from sklearn.svm import SVC

svc = SVC(kernel='linear', decision_function_shape='ovr')
svc.fit(X_lda, y)

SVM with LDA for maximum class separability (Image by author)

Classes can be clearly separated with linear decision boundaries. Only one data point was misclassified. Compare this with the PCA output that we obtained earlier, there, many points will be misclassified if we would draw linear decision boundaries.

So, LDA is highly effective than PCA for dimensionality reduction in classification datasets because LDA maximizes class separability while reducing the dimensionality of the data.

Kindly note that the code for drawing SVM decision boundaries (hyperplanes) is not included in the above code as it requires a thorough understanding of how SVMs work behind the scenes, which is beyond the scope of this article.

Note: The term ‘hyperplane’ is the correct way to refer to the decision boundary when considering high-dimensional data. In a two-dimensional space, a hyperplane is just a straight line. Likewise, you can imagine the shape of a hyperplane in higher-dimensional space.

The most important hyperparameter in both PCA and LDA algorithms is n_components in which we specify the number of components that LDA or PCA should find.

The guidelines for selecting the best number of components for PCA have already been discussed in my article, How to Select the Best Number of Principal Components for the Dataset.

The same guidelines are also valid for LDA.

  • If the sole purpose of applying LDA is for data visualization, you should keep 2 (for 2D plots) or 3 (for 3D plots) components. We are only familiar with 2D and 3D plots and we can’t imagine other high-dimensional plots.
  • As I explained earlier, the cumulative explained variance plot is extremely useful to choose the right number of components.
  • The maximum number of components that LDA can find is equal to the number of classes minus one in the classification dataset.



Source link

Previous Post

Significance of Von Neumann Entropy in Quantum Systems part2(Quantum Computing) | by Monodeep Mukherjee | Jan, 2023

Next Post

A Beginner-Friendly Introduction to MLOps | by Chayma Zatout | Jan, 2023

Next Post

A Beginner-Friendly Introduction to MLOps | by Chayma Zatout | Jan, 2023

Connecting Amazon Redshift and RStudio on Amazon SageMaker

Use machine learning to detect anomalies and predict downtime with Amazon Timestream and Amazon Lookout for Equipment

Related Post

Machine Learning

Want to get a quick and profound overview of the 42 most common used Machine Learning Algorithms? | by Murat Durmus (CEO @AISOMA_AG) | Jan, 2023

by admin
January 30, 2023
Machine Learning

Scan Business Cards to Excel or Google Contacts

by admin
January 30, 2023
Artificial Intelligence

Amazon SageMaker built-in LightGBM now offers distributed training using Dask

by admin
January 30, 2023
Artificial Intelligence

Don’t blame a Data Scientist on failed projects! | by Darya Petrashka | Dec, 2022

by admin
January 30, 2023
Edge AI

BrainChip Tapes Out AKD1500 Chip in GlobalFoundries 22nm FD SOI Process

by admin
January 30, 2023
Big Data

You Have More Data Quality Issues Than You Think: Here’s Why.

by admin
January 30, 2023

© 2023 Machine Learning News Hubb All rights reserved.

Use of these names, logos, and brands does not imply endorsement unless specified. By using this site, you agree to the Privacy Policy and Terms & Conditions.

Navigate Site

  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

Newsletter Sign Up.

No Result
View All Result
  • Home
  • Machine Learning
  • Artificial Intelligence
  • Big Data
  • Deep Learning
  • Edge AI
  • Neural Network
  • Contact Us

© 2023 JNews - Premium WordPress news & magazine theme by Jegtheme.