Recently, I finished one of the best seller by Mark Manson — “The Subtle Art of Not Giving a F*ck”. One of the key takeaways from the book that can be applied to data science and modern day ML which leverages big data is the idea of focusing on “what really matters”. Unlike in our lives, where things get really complicated when we determine — what really matters, in data science, this means prioritizing the most important variables and metrics that truly impact business outcomes, rather than getting bogged down in analyzing every possible data point and using a obnoxious kitchen-sink model in production.
The other day at work I got a request from one of a program team to onboard 30 new metrics as they uncovered the data from a new tool. The team requested to use those metrics as input feature in the existing sale price prediction model which was already working fine but has validation MAPE of ~15%. The expectations was to use those newly onboarded features to make a better model and reduce the error. So what could go wrong here?
As a data scientist, you would have come across the term “Curse of Dimensionality”. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. The common theme of these problems is that when the dimensionality increases, the volume of the space increases so fast that the available data become sparse.
The curse of dimensionality is a phenomenon that arises when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience.
Picture Credit: https://miro.medium.com
So how to fix this?
One way to deal with “Curse of Dimensionality” is to perform dimensionality reduction.
There are several methods present for dimensionality reduction- these include :
For this article — we will focus on t-SNE !
t-SNE is a nonlinear dimensionality reduction technique for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.
It can be said that, this algorithm has picked up very recently. I came to find out through one of the AWS Sagemaker lab videos where the instructor was trying to reduce the dimensions of a result obtained after training a Blazing Test model over reviews in MovieLens dataset. The “t-distributed Stochastic Neighbor Embedding (tSNE)” algorithm has become one of the most used and insightful techniques for exploratory data analysis of high-dimensional data. Used to interpret deep neural network outputs in tools such as the TensorFlow Embedding Projector and TensorBoard, a powerful feature of tSNE is that it reveals clusters of high-dimensional data points at different scales while requiring only minimal tuning of its parameters.
Credit- Real-time evolution of the tSNE embedding for the complete MNIST dataset with TensorFlow Embedding Projector. The dataset contains images of 60,000 handwritten digits.
Despite these advantages, the computational complexity of the tSNE algorithm limits its application to relatively small datasets. While several evolutions of tSNE have been developed to address this issue (mainly focusing on the scalability of the similarity computations between data points), they have so far not been enough to provide a truly interactive experience when visualizing the evolution of the tSNE embedding for large datasets.
So here are the steps which takes place under the hood when t-SNE algorithm is executed:
- Compute pairwise similarities: The first step of t-SNE is to compute pairwise similarities between all data points in the high-dimensional space. The most commonly used similarity measure is the Gaussian kernel, which computes the similarity between two data points based on their Euclidean distance.
- Compute similarity probabilities: The next step is to compute similarity probabilities between all pairs of data points in the high-dimensional space. The similarity probabilities are computed by normalizing the pairwise similarities such that they sum up to one for each data point.
- Compute similarity probabilities in the low-dimensional space: t-SNE then computes similarity probabilities for all pairs of data points in the low-dimensional space. The similarity probabilities are computed using a similar Gaussian kernel as in step 1, but with a different variance parameter.
- Minimize the Kullback-Leibler divergence: The goal of t-SNE is to minimize the Kullback-Leibler divergence between the similarity probabilities in the high-dimensional space and the low-dimensional space. This is done by iteratively adjusting the positions of the data points in the low-dimensional space until the two sets of similarity probabilities are as similar as possible. Here is the fun fact- Minimizing the K-L divergence is equivalent to minimizing the negative log-likelihood, which is equivalent to maximizing the likelihood between the Poisson model and the data.
- Gradient descent: t-SNE uses gradient descent to minimize the Kullback-Leibler divergence. At each iteration, the algorithm computes the gradient of the divergence with respect to the positions of the data points in the low-dimensional space and updates their positions accordingly. The algorithm continues to iterate until the Kullback-Leibler divergence reaches a minimum or until a stopping criterion is met.
- Hyperparameter tuning: t-SNE has several hyperparameters that need to be tuned, including the perplexity (which controls the balance between local and global structure), the learning rate (which controls the step size in the gradient descent), and the number of iterations (which controls the length of the optimization process). The hyperparameters need to be carefully chosen to ensure that the resulting visualization captures the underlying structure of the data.
https://www.youtube.com/watch?v=NEaUSP4YerM&ab_channel=StatQuestwithJoshStarmer
t-SNE IPL usecase — and birth of Golden Duck Club !
Lets take a diversion as too much theory just makes things boring !
IPL 2023 is few days away and you have just finished reading t-SNE and desperately want to model it out on anything. And lets suppose you are a passionate IPL fan and want to analyze the performance of different teams and players over the years. You have collected a large dataset of IPL matches, including information on the runs scored, wickets taken, strike rates, and other performance metrics for each player and team.
To visualize this high-dimensional data, you decide to use t-SNE. You want to see if there are any clear clusters of players or teams with similar performance patterns over the years.
After running t-SNE on your dataset, you get a beautiful 2D scatter plot that shows clear clusters of teams and players with similar performance patterns. You notice that some teams and players are clustered together, indicating that they have similar strengths and weaknesses.
As you start exploring the visualization, you notice a strange cluster of players that stand out from the rest. Upon closer inspection, you realize that these are all the players who have scored a golden duck in the IPL (getting out on the first ball without scoring any runs)!
You can’t help but chuckle at this discovery and decide to call this cluster the “Golden Duck Club”. You also notice that some players who have a history of getting out on golden ducks are clustered together, suggesting that they may have a similar batting style or weakness that makes them susceptible to early dismissals.
Overall, t-SNE has helped you uncover interesting patterns and clusters in your IPL dataset, including a funny discovery of the “Golden Duck Club”. You can use this visualization to gain insights into the performance of different teams and players and make more informed decisions when drafting your fantasy IPL team!
To conclude –
t-SNE is an iterative algorithm that computes pairwise similarities between data points, computes similarity probabilities in high-dimensional and low-dimensional spaces, and minimizes the Kullback-Leibler divergence using gradient descent. The algorithm requires careful hyperparameter tuning to ensure that the resulting visualization captures the structure of the data.