The ideas of the preceding section are simple and intuitive, but are not how we should be attempting to formulate the problem. Given the relatively formal approach to defining optimization objectives in order to find the optimal discriminant, similarly here we should be formulating an objective, and not guessing at an algorithm.
Standard K-means (left) characterizes each cluster based on a prototype only, and so has no data-driven notion of class size. Generalizing the clustering problem to include a dependence on class statistics via a Mahalanobis distance seems like an excellent idea (middle), because it can accommodate nonspherical class shapes. However, such additional flexibility can easily lead to poor convergence (right), since the absence of class labels means that the K-Means iteration lacks certain restraints. Here, for example, the right panel shows Mahalanobis K-means to have learned one large cluster to capture nearly all of the data points, and a second very small cluster (small arrow) of essentially zero size.
Given an initial set of prototypes (left), the basic idea behind K-Means is to iteratively classify all of the given data points and then to update the prototypes, as described in Algorithm above. There can be a variety of stopping criteria, however typically the prototypes stop changing in relatively few iterations. The method illustrated here is based on a Euclidean distance.
K-Means is essentially an iterative distance-based classifier. Given a set of class prototypes and a distance function, each data point can be classified as the closest prototype. Then, given the classified data points, we can take their sample mean to update the class prototypes, as illustrated above. The algorithm as written here is based on a Euclidean distance, but is easily generalized to include sample covariance calculation to allow for non-Euclidean distances.