## How to build a clustering model focusing on explainability

Clustering is an unsupervised learning Machine Learning technique to identify groups of similar data points in a given dataset. In theory, those groups will have the same properties that can help the explainability of the data and pattern recognition. Clustering applications in real life include customer and product segmentation, facility location optimization, recommendation systems, medical imaging, sports scouting, and more.

One of the most used algorithms in unsupervised learning is K-Means which we will use in this article. K-Means is a simple algorithm that groups similar data points by beginning with randomly selected centroids and iterating over them to optimize the position of the centroids. The algorithm stops iterating when the centroids stabilize or the maximum number of iterations reaches.

Although it can be applied to high-dimensional data, a common form of clustering is picking two variables of the data, applying the clustering algorithm, in our case K-means, and then plotting a scatter with the color differentiating the clusters. The image below shows an application with three clusters:

With the labeled groups, business decisions can be taken from there. Imagine an online retailer with a customer that has recently spent money on baby clothes and diapers. We can label this customer as interested in baby items and use a recommendation system to offer products that other people of the baby items group bought, such as toys or a cradle.

Or imagine you are the scout of a basketball team responsible for identifying the best talents of a youth league with the player’s stats available. You can define the most important stats and cluster the players. Afterward, it should be possible to identify the group of outstanding players you should take a closer look at and the ones that won’t be evaluated further.

Graphing library Plotly has a scatter 3-D method that allows us to overstep the 2D barrier in the data visualization and improves the explainability of the K-Means clustering. As the data is plotted in a 3-D scatter it’s possible to visually correlate the clusters with the features and identify the properties that the data points have in common that led the algorithm to form those groups. Here is an example with six clusters:

In this article, We will present a walkthrough on how to build an interactive visualization of 3D clustering like the one displayed above. For this, I have chosen the Basketball dataset of Kaggle, as it should be possible to illustrate the basketball players example using the stats data of over 4000 former and current National Basketball Association (NBA) players. Also, we’ll go into an approach to define the ideal number of clusters. Finally, with the interactive visualization built, We will discuss the properties of each group, adding explainability to the model.

So let’s begin loading the data. Our dataset is stored in a database, so we must create a connection to the database using SQLite and define its path. Then, in SQL, we must write the query to extract the desired data or in our case the players statistics which are stored on the *Player_Attributes *table. As we don’t know yet which features will be used on the clustering it’s a good idea to extract the full table with *. Using the pandas read_sql method, it’s possible to store the output from the SQL query in a pandas data frame which we can investigate and pick the best features for the clustering.

So here are the available features for us to build the 3-D clustering:

Looking at the features most of them are categorical, but we have the statistics we want in “PTS”, “AST” and “REB” which are shorts for **points per game** (PPG), **assists per game** (APG), and **rebounds per game** (RPG), the three most commons statistics of a basketball player. Therefore, in this article, we will build a 3-D Clustering of over 4000 past and current NBA Players with the three dimensions PPG, RPG, and APG.

When a player hits a shot he scores a point. A player can score three points in one shot if the shot is made behind the three-point line, two points if the shot is from inside the three-point line, and one point if the player scores from the free throw line, which happens only after a foul. A rebound is taken when a player grabs the ball after a missed shot. The shot can be from his own or any of his teammates. Assists happen when a player passes the ball to another teammate who scores the ball right away.

Now we have to answer the question: How many clusters should we have to group in an optimal way those over 4000 players? Logic tells us that the more clusters, more variation will be explained by the model. That is true, except after some point, the model will overfit. In general, one must pick the number of clusters so that adding another cluster won’t add much to the modeling of the data. Our goal is to find that optimal point, and there is a heuristic called the silhouette method that can help us, and the Yellowbrick lib has a function for it. Here is the code to use it with K-Means and three features: