Overview

With Gaussian mixture models, a cluster is represented by using three elements; mean, variance and a weight. In this method each sample belongs to all clusters with a given probability. In many situations, however, we would like to have a single cluster per sample. One way to do so, is to use hard-clustering techniques.

A well known hard-clustering algorithm is k-means clustering. The algorithm is frequently used because it is easy to interpret and implement. In k-means clustering we find the shortest distance between the sample point and all the centroids of the clusters. A data point belongs to the cluster that has the smallest distance from the cluster centroid.

K-means clustering

The k-means algorithm proceeds by first partitioning a dataset into $k$ distinct clusters and then arbitrarily selects centroids of each of these clusters. It then iteratively updates partitions by first assigning points to the closer cluster and then it updates the centroids. This process is repeated until some convergence criteria is satisfied.

Three steps

Thus, k-means can be broken into four steps. In particular,

Initialization
Classification
Recentering
Repeat steps 2 and 3

Steps 2-4 of the algorithm above compute exact distances between points and centroids, and then update the centroids to the center of mass of each cluster. However, given the random initialization step, the algorithm qualifies also as a Monte Carlo randomized algorithm [1].

The initialization step is crucial for the algorithm [1]. It influences heavily the final result; a bad choice can slow down convergence and will likely lead to an early stop and a poor result [1]. Furthermore, because the algorithm will remove centroids to which no points are assigned, e.g. when the initial choice consists of several centroids close to each other, this could lead to an unwanted reduction of the number of centroids in the early stages of the algorithm. Therefore, in practice, the algorithm is always used with a random-restart strategy; it is run several times on the dataset, each time with a different random initialization of the centroids, and then results are compared to choose the best clustering [1].

The aforementioned process minimizes the total intra-cluster variation across all clusters. Mathematically, k-means clustering minimizes a loss function. For instance assuming that we minimize a Euclidean distance,

$$L = \sum_{j=1}^k \sum_{\mathbf{x}_i \in C_j}|| \mathbf{x}_i - \boldsymbol{\mu}_j||^2$$

This quantity is also called the inertia [3]. Large levels of inertia imply low cohesion. Directly, minimizing the loss function $L$ i.e. find a global minimum has exponential complexity [3]. The the k-means algorithm employs the iterative approach described above to avoid such complexity.

The algorithm can be seen as a search heuristic converging to a (local) optimum [1]. Convergence may be checked by comparing whether the computed cluster centroids have changed or if the classification of the points has not changed which of course implies the former. However, this may not be enough to guarantee convergence, so usually the algorithm is run with a maximum number of iterations.

Details

Now that we have an overview of what the k-means does, let's dive deeper into the implementation details. To start with, the algorithm expects as an input the following:

The number of clusters $k$ to partitioned the dataset
The number of maximum iterations to perform
A tolerance that indicates whether two points are the same or not in the distance metric

As mentioned in the previous section, the first step is the initialization of the centroids $\boldsymbol{\mu}_j$.

Initialization

There are several alternatives we could use, e.g. randomly drawing each coordinate from the domain’s boundaries, or using actual points from the dataset [1]. However, a solution that has several advantages is randomly perturbing $k$ points casually drawn from the dataset [1]. This has the following advantages [1]:

If each coordinate of a centroid was generated completely at random, we would have to first scan the dataset to find the acceptable range for each coordinate. We avoid this by randomly perturbing $k$ points drawn from the dataset
By uniformly drawing points from the dataset, the centroids will be close to points in the data and centroids will be drawn with higher probability in areas with higher density of points.
Randomly perturbing the points, reduces the risk that all the generated points will be concentrated in the denser areas.

Classification

The next step is to start the iterations and perform classification and recentering. Classification is performed by looping over over the dataset and compute the distance between $\mathbf{x}_i$ and the current $\boldsymbol{\mu}_j$. The data point $\mathbf{x}_i$ is assigned to cluster $C_j$ based on the following criterion

$$C_j = argmin_j d(,\mathbf{x}_i,\boldsymbol{\mu}_j) $$

i.e. the data point is assigned to the cluster that has the smallest distance from the cluster centroid.

Recentering

Once all points have been clustered, the new centroids are computed according to

$$\boldsymbol{\mu}_j = \frac{1}{N_j} \sum_{\mathbf{x}_i \in C_j} \mathbf{x}_i, \forall j $$

$$$$

that is the new $\boldsymbol{\mu}_j$ is the center of mass of all the points that belong in the cluster.

Issues

When the clusters have approximately a spherical shape and are linearly separable the algorithm does a pretty good job. This is shown in the figure below. Note that the algorithm is able to correctly identify clusters with different density.

Remark

You may want to go over the following example Demonstration of k-means assumptions. This example is meant to illustrate situations where k-means will produce unintuitive and possibly unexpected clusters.

Figure 1. Clustering produced by k-means. Image from [1].

However, the algorithm has several weaknesses which we discuss bellow.

The first weakness of the algorithm is that it cannot detect outliers [1]. In fact, outlier points are added to the closest clusters. Recall that in general the mean is sensitive to outliers. Since centroids are computed as the centers of the mass of the clusters, the centroids of the clusters will be pulled by outliers away from the best position they could hold. Therefore, we should remove outliers from the data set before performing k-means clustering.

The second point to notice, is that the algorithm can only produce/distinguish spherical clusters [1]. However, in real data sets not all clusters are spherical.

Furthermore, if the clusters in the data set are not linearly separable, then it is difficult to approximate them with spherical clusters. As a result k-means cannot separate non-convex clusters e.g. clusters shaped as two concentric rings [1]. Thus, in these situations, the algorithm will have some hard time in order to find good solutions.

The fourth point to remember is that the algorithm cannot automatically find the number of clusters present in the data set. Instead it expects this as an input. This means that unless we have some insight deriving from domain knowledge that indicates the number of clusters, we will need to run the algorithm multiple times, trying different values for the number of centroids, and comparing the results using some kind of metric.

In fact, one way to determine the number of clusters using the so-called elbow method. In this method, we select the optimal number of clusters by fitting the model with a range of values for $k$ . If the line chart resembles an arm, then the “elbow” (the point of inflection on the curve) is a good indication that the underlying model fits best at that point [4].

Python example

Let's explore the scikit-learn API to perform some clustering using K-means on the well-known Iris dataset. Arguably, this will not be something exciting. The example is taken from K-Means Elbow Method code for Python.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn import datasets

iris = datasets.load_iris()
df=pd.DataFrame(iris['data'])

kmeanModel = KMeans(n_clusters=4)
kmeanModel.fit(df)

KMeans(n_clusters=4)

df['k_means']=kmeanModel.predict(df)
df['target']=iris['target']

fig, axes = plt.subplots(1, 2, figsize=(16,8))
axes[0].scatter(df[0], df[1], c=df['target'])
axes[1].scatter(df[0], df[1], c=df['k_means'], cmap=plt.cm.Set1)
axes[0].set_title('Actual', fontsize=18)
axes[1].set_title('K_Means', fontsize=18)
plt.show()

Now that we are at it, let's also use the elbow method to determine the right number of clusters.

distortions = []
K = range(1,10)
for k in K:
    kmeanModel = KMeans(n_clusters=k)
    kmeanModel.fit(df)
    distortions.append(kmeanModel.inertia_)

plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()

We can see that the right number of clusters to use is $k=3$.

Summary

In this section we reviewed the k-means algorithm. K-means is a centroid-based hard-clustering method. It is a good option only for low-to-medium-dimensional (with at most around 20 dimensions) datasets where we know that clusters can be accurately approximated with hyperspheres i.e. cluster shapes are spherical and the number of clusters can be estimated a priori [1]. It also works well even if the dataset does not have a homogeneous distribution. On the other hand, the algorithm works poorly on high-dimensional data, and when clusters cannot be approximated with hyperspheres [1].

We have not touched upon evaluation metrics in order to assess the performance of the algorithm. We will do so in a future post.

References

Marcello La Rocca, Advanced algorithms and data structures, Manning Publications.
k-means clustering
Giuseppe Bonaccorso, Mastering machine learning algorithms, Packt Publications.
Elbow method.