Measuring differences within similar cluster points
Clustering is an unsupervised machine learning technique used for discovering interesting patterns in data. An example would be grouping similar customers based on their behavior, building a spam filter, identifying fraudulent or criminal activity.
In Clustering, similar items(or data points) are grouped together. However, we do not only want to group similar items together, we would also like to measure how similar or different they are. To solve this, we can easily create a scoring algorithm.
In this example, I use a simple k-means clustering method. You can read about it here. We generate isotropic Gaussian blobs for clustering with sklearn.datasets.make_blobs.
Next we build a simple k-means algorithm with 3 clusters and get the centroids of these clusters.
Now, to score each of the points in the different clusters, we could estimate how close they are to the center of the cluster and compare that to the farthest point in the cluster. In this example, our dataset involves 2 columns, so we could easily measure the sum of their squared differences. These distances can be converted to percentages for easy interpretation.
The measurements would not only give us an estimate of how far a point is to the center of a cluster, but how close they are to possibly falling off to the next cluster. This is particularly interesting for problems like customer segmentation, in which case we would like to test how each marketing approach taken, affects the customer.
You can also check out my article on “Building Customer Clusters Using Unsupervised Machine Learning” here.