Working with high-dimensional data might be difficult in the huge field of machine learning. Techniques for reducing dimensions come to the rescue by simplifying complicated data while keeping all of the important details. We’ll examine one such method, known as Isomap, in this article that is offered by the well-known machine learning package Scikit-learn (Sklearn). We will explain important terms, demystify the idea, and walk newcomers through the process of using Isomap to reduce dimensionality.
The goal of the non-linear dimensionality reduction method known as isomap is to maintain the data’s inherent geometry. It is predicated on the notion that the length of the shortest path, on a manifold that approximates the data, is the real distance between two points in a high-dimensional space. High-dimensional data visualization, clustering, classification, and other machine learning tasks may be accomplished with isomap.
In this article, we will learn about the concept and implementation of Isomap using the scikit-learn library in Python. We will also see some examples of how Isomap can be applied to different datasets.
Isomap stands for Isometric Mapping, which means that it tries to map the data points from a high-dimensional space to a low-dimensional space in such a way that the distances between the points are preserved as much as possible. Isomap is a type of manifold learning, which is a branch of machine learning that deals with finding the underlying structure of high-dimensional data.
The main steps of Isomap are:
- With the data points, create a neighborhood graph in which every point is linked to all other points within a specified radius or its k nearest neighbors. The Euclidean distance between each point on the graph determines the weight of each edge.
- Compute the shortest path distance between every pair of points on the graph, using an algorithm such as Dijkstra’s or Floyd-Warshall. This distance is called the geodesic distance, and it approximates the true distance on the manifold.
- To create a low-dimensional embedding of the points that maintain the pairwise distances as much as feasible, use a traditional multidimensional scaling (MDS) approach to the geodesic distance matrix.
Isometric Mapping, is a nonlinear dimensionality reduction technique. Its primary objective is to unfold and preserve the intrinsic geometric structure of high-dimensional data in a lower-dimensional space. Unlike linear techniques such as Principal Component Analysis (PCA), Isomap excels in capturing nonlinear relationships, making it especially useful for datasets with intricate structures.
- Dimensionality Reduction: The process of reducing the number of features (dimensions) in a dataset while retaining essential information.
- Isometric Mapping (Isomap): A technique that aims to maintain the pairwise geodesic distances between all data points, preserving the underlying geometry of the data.
- Geodesic Distance: The shortest path between two points along the surface of a curved space, like a manifold.
To implement Isomap in Python, we can use the scikit-learn library, which provides a class called Isomap that performs the above steps. The Isomap class has several parameters that we can tune, such as the number of neighbors, the radius, the number of components, the eigenvalue solver, and the metric. We can also access the attributes of the Isomap object, such as the embedding vectors, the kernel PCA object, the nearest neighbors object, the distance matrix, and the number of features.
Here is an example of how to use the Isomap classes.
In this example, we will use a public dataset that contains the measurements of 150 iris flowers of three different species. This dataset has four dimensions, but it can be reduced to a lower-dimensional space that separates the different classes. We will use Isomap to reduce the dimensionality of the data and visualize the result.
We will need the same libraries as in the previous example, plus the pandas library for data manipulation and the seaborn library for data visualization. We can import them as follows:
# Import the libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.datasets import load_iris
from sklearn.manifold import Isomap
from mpl_toolkits.mplot3d import Axes3D
We can use the load_iris function from the sklearn.datasets module to load the iris dataset. This function returns a dictionary-like object that contains the data, the target, the feature names, and the target names. We can convert the data and the target into a data frame to frame for easier manipulation.
# Load the dataset
iris = load_iris()
X = pd.DataFrame(iris.data, columns=iris.feature_names)
y = pd.DataFrame(iris.target, columns=['species'])
We can use the Seaborn library to plot the original data in a pair plot. A pair plot shows the pairwise relationships between the variables in a dataset, as well as the distribution of each variable along the diagonal. We can use the hue parameter to color the points by the species.
# Plot the original data
sns.pairplot(pd.concat([X, y], axis=1), hue='species')
plt.show()
The output of the code is:
We can see that the data points are clustered by the species, and that some variables are more discriminative than others. For example, the petal length and the petal width seem to separate the three classes well, while the sepal length and the sepal width have more overlap.
We can use the same Isomap class as before to apply the Isomap technique to the data. We can specify the number of components and the number of neighbors as parameters. We can then use the fit_transform method to fit the model and transform the data.
# Apply Isomap
iso = Isomap(n_components=2, n_neighbors=10)
X_iso = iso.fit_transform(X)
We can use the Seaborn library again to plot the transformed data in a scatter plot. We can use the same hue parameter as before to show the correspondence between the original and the transformed data. We can also set the title and the labels of the axes.
# Plot the transformed data
plt.figure(figsize=(8, 6))
sns.scatterplot(x=X_iso[:, 0], y=X_iso[:, 1], hue=y['species'], palette='Set1')
plt.title('Isomap embedding')
plt.xlabel('Component 1')
plt.ylabel('Component 2')
plt.show()
The output of the code is:
We can see that Isomap has successfully reduced the dimensionality of the data to a 2D plane, preserving the separation of the classes. This shows that Isomap has captured the essential features of the data, which are the differences between the species.
import numpy as np
from sklearn.datasets import make_s_curve
from sklearn.manifold import Isomap
import matplotlib.pyplot as plt
We create a synthetic dataset to demonstrate Isomap on a dataset with a non-linear structure.
X, color = make_s_curve(n_samples=1000, random_state=42)
Visualization helps us understand the structure of the original synthetic dataset.
fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')
ax.scatter(X[:, 0], X[:, 1], X[:, 2], c=color, cmap=plt.cm.get_cmap("viridis", 12))
ax.set_title("Original Synthetic Dataset")
plt.show()
Output:
We initialize an Isomap instance with parameters, specifying the number of neighbors and the desired number of dimensions in the embedded space.
n_neighbors = 20
n_components = 2
iso = Isomap(n_neighbors=n_neighbors, n_components=n_components)
Applying Isomap fits the model to the synthetic data, finding a lower-dimensional representation.
Python3
X_projected = iso.fit_transform(X)
Visualization of the reduced-dimensional representation helps us observe the transformation achieved by Isomap.
plt.figure(figsize=(8, 8))
plt.scatter(X_projected[:, 0], X_projected[:, 1], c=color, cmap=plt.cm.get_cmap("viridis", 12))
plt.title("Isomap Projection of Synthetic Dataset")
plt.show()
Output:
Let’s use the “Labeled Faces in the Wild” (LFW) dataset, which consists of labeled images of faces. We can directly load it from the Scikit-learn dataset module using the fetch_lfw_people function.
# Example 2: Isomap with Public Dataset (Labeled Faces in the Wild)
# Step 1: Import necessary libraries
from sklearn.datasets import fetch_lfw_people
from sklearn.manifold import Isomap
import matplotlib.pyplot as plt# Step 2: Load LFW dataset
lfw_people = fetch_lfw_people(min_faces_per_person=70, resize=0.4)
data = lfw_people.data
target = lfw_people.target# Step 3: Visualize a few original faces
fig, axes = plt.subplots(2, 5, figsize=(12, 5), subplot_kw={'xticks':[], 'yticks':[]})
for i, ax in enumerate(axes.flat):
ax.imshow(data[i].reshape(50, 37), cmap='gray')
ax.set_title(f"Person: {lfw_people.target_names[target[i]]}")
plt.show()# Step 4: Create an Isomap instance
n_neighbors = 30
n_components = 2
iso = Isomap(n_neighbors=n_neighbors, n_components=n_components)# Step 5: Fit and transform the Faces data
data_projected = iso.fit_transform(data)# Step 6: Visualize the reduced-dimensional representation
plt.figure(figsize=(10, 8))
scatter = plt.scatter(data_projected[:, 0], data_projected[:, 1], c=target, cmap='viridis', s=10)
plt.colorbar(scatter, label='Person Label')
plt.title("Isomap Projection of Labeled Faces in the Wild Dataset")
plt.xlabel("Isomap Component 1")
plt.ylabel("Isomap Component 2")
plt.show()
Output:
The reduced-dimensional representation of the LFW dataset was obtained through Isomap, with colors representing different person labels.
Isomap is a powerful tool for dimensionality reduction, particularly suitable for datasets with intricate structures. By understanding its concept, terminology, and implementation steps, beginners can enhance their skills in handling high-dimensional data effectively. Experimenting with Isomap on various datasets will provide valuable insights into its capabilities and applications in real-world scenarios.