How to build a clustering model focusing on explainability
Clustering is an unsupervised learning Machine Learning technique to identify groups of similar data points in a given dataset. In theory, those groups will have the same properties that can help the explainability of the data and pattern recognition. Clustering applications in real life include customer and product segmentation, facility location optimization, recommendation systems, medical imaging, sports scouting, and more.
One of the most used algorithms in unsupervised learning is K-Means which we will use in this article. K-Means is a simple algorithm that groups similar data points by beginning with randomly selected centroids and iterating over them to optimize the position of the centroids. The algorithm stops iterating when the centroids stabilize or the maximum number of iterations reaches.
Although it can be applied to high-dimensional data, a common form of clustering is picking two variables of the data, applying the clustering algorithm, in our case K-means, and then plotting a scatter with the color differentiating the clusters. The image below shows an application with three clusters:
With the labeled groups, business decisions can be taken from there. Imagine an online retailer with a customer that has recently spent money on baby clothes and diapers. We can label this customer as interested in baby items and use a recommendation system to offer products that other people of the baby items group bought, such as toys or a cradle.
Or imagine you are the scout of a basketball team responsible for identifying the best talents of a youth league with the player’s stats available. You can define the most important stats and cluster the players. Afterward, it should be possible to identify the group of outstanding players you should take a closer look at and the ones that won’t be evaluated further.
Graphing library Plotly has a scatter 3-D method that allows us to overstep the 2D barrier in the data visualization and improves the explainability of the K-Means clustering. As the data is plotted in a 3-D scatter it’s possible to visually correlate the clusters with the features and identify the properties that the data points have in common that led the algorithm to form those groups. Here is an example with six clusters:
In this article, We will present a walkthrough on how to build an interactive visualization of 3D clustering like the one displayed above. For this, I have chosen the Basketball dataset of Kaggle, as it should be possible to illustrate the basketball players example using the stats data of over 4000 former and current National Basketball Association (NBA) players. Also, we’ll go into an approach to define the ideal number of clusters. Finally, with the interactive visualization built, We will discuss the properties of each group, adding explainability to the model.
So let’s begin loading the data. Our dataset is stored in a database, so we must create a connection to the database using SQLite and define its path. Then, in SQL, we must write the query to extract the desired data or in our case the players statistics which are stored on the Player_Attributes table. As we don’t know yet which features will be used on the clustering it’s a good idea to extract the full table with *. Using the pandas read_sql method, it’s possible to store the output from the SQL query in a pandas data frame which we can investigate and pick the best features for the clustering.
So here are the available features for us to build the 3-D clustering:
Looking at the features most of them are categorical, but we have the statistics we want in “PTS”, “AST” and “REB” which are shorts for points per game (PPG), assists per game (APG), and rebounds per game (RPG), the three most commons statistics of a basketball player. Therefore, in this article, we will build a 3-D Clustering of over 4000 past and current NBA Players with the three dimensions PPG, RPG, and APG.
When a player hits a shot he scores a point. A player can score three points in one shot if the shot is made behind the three-point line, two points if the shot is from inside the three-point line, and one point if the player scores from the free throw line, which happens only after a foul. A rebound is taken when a player grabs the ball after a missed shot. The shot can be from his own or any of his teammates. Assists happen when a player passes the ball to another teammate who scores the ball right away.
Now we have to answer the question: How many clusters should we have to group in an optimal way those over 4000 players? Logic tells us that the more clusters, more variation will be explained by the model. That is true, except after some point, the model will overfit. In general, one must pick the number of clusters so that adding another cluster won’t add much to the modeling of the data. Our goal is to find that optimal point, and there is a heuristic called the silhouette method that can help us, and the Yellowbrick lib has a function for it. Here is the code to use it with K-Means and three features:
The Silhouette Score was the metric used, which is a metric to calculate how well the points are grouped into the clusters, going from -1 to 1. With 1 being the optimal and -1 the worst grouping possible. The formula of the silhouette score is S = (b — a)/max(a,b) where “b” is the average distance between the centroids of the clusters to each other and “a” is the average distance between the data points within each cluster. Here’s the plot:
The highest score was with two clusters, but rather than just using the number of clusters with the highest score is relevant to think about the context of the data. In this article, We will cluster over 4000 NBA players based on their stats.The model wouldn’t have much explainability if we divided our dataset into only two clusters, as it would be just the above-average players cluster vs below-average players cluster. Therefore, we need a higher cluster number to later observe with clarity the properties that led the K-Means algorithm to form each cluster.
If we look at the step from five to six clusters the model increases its silhouette score, meaning that six clusters are a better grouping than five. Looking at our goal of having the highest number of clusters while maintaining a good silhouette score we will pick six clusters for our model. Later, when we explain the properties of every cluster, it will be possible to understand better why this is the best choice.
With the number of clusters defined it’s time to use Plotly and build our 3-D clustering, so we must build a function:
Detailing the steps of the function:
- Drop the rows that have NaNs on the columns of the chosen features.
- Reduce the dimensionality of the data frame to three features using the function inputs.
- Instantiate the standard scaler. We’ll need to standardize our clustering data as the scales of our features are different, and we do not want one of them to have more impact on the clustering just because the scale is broader.
- Fit and transform the data frame with the standard scaler and convert it to a NumPy array.
- Define the hover that will be used on the 3-D scatter plot. On our case we used the “DISPLAY_FIRST_LAST” column that contains the full names of the players.
- Create the K-Means model using the function input for the number of clusters and set the maximum of iterations to 100.
- Fit the array to the model.
- Create a column in the reduced data frame with the labels provided by the model to each of the data points.
- Use Plotly’s “scatter_3d” to create the visualization using the previously defined hover and the labels column as the color parameter.
- Show the visualization.
With the function built now, we just have to call it with the previously discussed features and the number of clusters and see the result.
clusterization_nba_players = clusterized_scatter(stats,6,'PTS','REB','AST')
And we get this visualization:
The six clusters are well distributed in the 3-D dimensional space, except for clusters 1 (purple) and 5 (yellow). Both are close to the origin, so we can assume that those data points are players that did not have high stats of PPG, RPG, and APG.
Before explaining the clusters, we must discuss basketball to create a context for our clustering explanation. Three positions exist in basketball: Guard, Forward, and Center. Guards are usually smaller and quicker players responsible for organizing the team, running the ball, and shooting. Although guards don’t have high rebound stats because they stay far from the basket, the best players in this position usually have high points and assists stats, as they are the ones with the ball most of the time. Forwards are taller than guards but shorter than centers. They are flexible players who can pass, shoot and also stay closer to the basket if necessary, either scoring or fighting for the rebounds. Centers, on the other hand, are the tallest players on the team with the main tasks of picking up the rebounds and scoring closer to the basket. So, from them, we must expect higher rebounds and points statistics but lower in assists.
Now we’ll look into each of the six clusters and try to explain the properties correlating them with the positions in the basketball game:
Yellow (5) — 4th tier Players — According to the stats, players in this cluster are the ones that did not have much impact during their NBA careers. It’s interesting to observe that the yellow cluster is an extremely dense cluster with 1693 data points, most of all. That means more than a third of the players in the dataset fit into this tier. Even though they managed to play in the league, their impact was not significant. Mean PPG: 2.5, Mean RPG: 1.3, Mean APG: 0.6
Dark Purple (1) — 3rd tier Players — The players of this cluster also did not have a memorable career but had more contributions to the teams they played than the 4th tiers. The dark purple is also a dense cluster with 968 players. Mean PPG: 5.7, Mean RPG: 3.4, Mean APG: 0.9
Blue (0) — 2nd tier Forwards/Centers — Players that had solid NBA stats focusing on the rebounding rather than assisting, by which we can assume this is the group formed mainly by forwards and centers. Mean PPG: 10.5, Mean RPG: 5.9, Mean APG: 1.6
Orange (3) — 2nd tier Guards — Also players with solid stats but especially on the assisting, with a better APG stat than the blue cluster but worst on rebounding. Therefore we can assume the players on the cluster are mainly guards. Mean PPG: 8.4, Mean RPG: 2.3, Mean APG: 2.9
Mustard (4) — 1st tier Forwards/Centers — In this cluster are the players with high points per game and outstanding rebounding, despite not shining on the assists, as the mean APG of this group is even lower than the second-tier guards. Therefore, it’s safe to say that the players in this cluster are the elite forwards and centers in the history of the league. It’s the less dense cluster with only 152 players. Mean PPG: 17.2, Mean RPG: 9.9, Mean APG: 2.6
Light Purple (2) — 1st tier Guards — This group is formed by the elite guards, displayed from the low rebounding and high assisting combined with a high point per game average. The mean APG of the players in this cluster is over the double of the 1st tier Forwards/Centers, but the RPG is less than half while the PPG is close. It’s not also a dense cluster with only 236 players. If we sum the mustard and light purple clusters, in which the elite players are, we only have about 9% of the dataset in those clusters. Mean PPG: 16.8, Mean RPG: 4.0, Mean APG: 5.4
Now to wrap up, just random observations on some outliers:
- Wilt Chamberlain (1st tier Forwards/Centers) — The historic center that played in the 60s and early 70s outstands in the 3-D scatter with his data point very far from any other. As the plot evidenciates, Wilt was dominant in his era at scoring and rebounding. In the 1961–62 season, he scored over 50 points while grabbing over 25 rebounds per game, widely considered one of the greatest features in the history of sports.
- Dennis Rodman (1st tier Forwards/Centers) — Despite “only” having 2m, which is short for his position, he is regarded as one of the best rebounders in the history of the league and is a five times champion. Even though he did not collaborate much on the offensive end with points and assisting, his complete dominance on the rebounds guaranteed him a place in the elite cluster. Had the data not been scaled in the clustering process, Dennis probably would not be in the same elite cluster as the PPG feature, in which he has only 7.3 PPG, has a broader range, and would influence more on the K-Means algorithm.
- Draymond Green (1st tier Guards) — Green is a special kind of a player. He is 1.98m forward, short like Dennis Rodman, except he does not excel on rebounding but rather on assisting. He is labeled one of the players that helped change the game with his passing and ability to space the floor, moving the ball quickly. Draymond is so good at passing that our model grouped him on the 1st tier Guards, even though he is a Forward and plays at the Center sometimes. Green is another case that if the data were not standardized, he would not rank in a 1st tier cluster, because of his low PPG.
Diving deeper into the details, understanding the properties of each cluster, and creating an explainability context extend a clustering analysis beyond the visualization. Explainability is a much-discussed topic in the Data Science community as data scientists might focus on the model rather than the problem, and the explainability may get lost.
I hope you enjoyed this read, and I am available for further questions and discussions!