K-means clustering

A while ago, I wrote a C program to implement the algorithm K-means clustering. For my first real blog post, I've decided to go through how it works.

So, what is clustering?

Clustering is a way of classifying data, where the data are grouped together into categories based on how similar they are. For example: if we had a bunch of limes, and a bunch of lemons, we could put them into two groups based on their colour.

Delicious citrus

Delicious citrus

This gif shows the evolution of K-means clustering for the data set I was working on. The little dots are data points, and their colour tells you what category they have been classified as. We'll discuss what the outlined circles are later.

K-means iterations

But what is K-means?

In K-means, we have K clusters. Data points are put in the cluster whose mean they are closest to. The outlined circles in the gif above show the cluster means. You can see them move with each iteration of the alogorithm until it reaches a local optimum.

How do we measure closeness?

I have said that clustering involves grouping data together based on similarity – but how is that measured? This depends on the algorithm. K-means works by iteratively finding the mean of each cluster, and assigning data points to the cluster whose mean they are closest to. We measure this closeness using the squared Euclidean distance.

Euclidean distance

Euclidean distance

In case you didn't know, the Euclidean distance is just the familiar "ruler" distance between two points.

Ok – so far we know that clustering groups data together based on similarity, and that K-means measures this similarity as the squared Euclidean distance between the data point and its cluster's mean. What we don't know is how the cluster means are calculated.

Expectation-Maximisation

K-means uses the Expectation-Maximisation algorithm, which consists of two steps: the expectation (E) step, and the maximisation (M) step. The E step evaulates how likely the current parameters are; the M step updates the parameters in order to maximise (or minimise) some measure. For K-means, the parameters are the means of the clusters and the data points' assignments, and we wish to minimise the distance between the data points and the means of their assigned clusters.

EM for K-means

Let's call our means $\mu_k, k\in\{0,\dots,K-1\}$, where $k$ is the cluster index. Similarly, let's call our data points $x_n, n \in\{0,N-1\}$, where $n$ is the data point index. Now, let us introduce a variable that tells us which cluster $x_n$ has been assigned to: $r_{nk} \in \{0,1\}$. If data point $x_n$ is in cluster $k$, then $r_{nk}=1$; otherwise $r_{nk}=0$.

rnk

Now we can write what we want to minimise with the EM algorithm:

$$J = \sum_{n=0}^{N-1} \sum_{k=0}^{K-1} r_{nk}|| x_n - \mu_k||^2$$.

This is the sum of the squared Euclidean distances between each data point and its assigned cluster.

In the E step, in order to evaluate how likely the means are, we need to know how the datapoints would be assigned. So, we find $r_{nk}$ so that the data points are assigned to their nearest cluster.

In the M step, we find the means that minimise $J$:

$$ \frac{\partial J}{\partial \mu_{k}} = -2 \sum_{n=0}^{N-1} r_{nk} (x_n - \mu_k) = 0$$ $$\Rightarrow \mu_k = \frac{\sum_n r_{nk}x_n}{\sum_n r_{nk}}$$

We keep repeating until the parameters no longer change. In the gif above you can see it takes 17 iterations for the dataset I was working on. How long it takes depends a lot on how the cluster means are initialised.

K-means finds a local optimum – the result isn't necessarily the best possible one. Which optimum you end up with also depends on how the means are initialised. There are several algorithms you can use for this, but I think I've talked enough ;o I hope K-means makes some sense to you now if it didn't before.

Want to know more?

If you want to know more I highly recommend Chris Bishop's textbook "Pattern Recognition and Machine Learning". You can also view my C code implementing this algorithm on my Github.