K-Means Clustering – One rule to group them all

Photo by Ellen Qin on Unsplash



This article will try to:
  • Explain graphically the basics of K-Means Clustering, an unsupervised machine learning algorithm
  • Discuss the method of finding an optimum value of K and centroid location of the clusters

What you are already supposed to know:
  • Basic mathematics and Euclidian geometry

In machine learning, one of the frequently encountered problems is grouping similar data together. You know the income level of people and now you want to group people with similar income levels together. You want to know who are the people with low-income power, the people with high or very high-income power, which you think can be helpful in devising a perfect marketing strategy. You have the shopping data of customers with you and now you want to group customers with similar shopping preferences together or you being the bio student want to know which cells share similar properties based upon the data about the cells you have at your hand.
All the above-stated problems come under the domain of an unsupervised machine learning method called clustering. Although there are a number of clustering algorithms there, when it comes to the simplest one, the award will go to K-Means clustering. To understand the algorithm let’s suppose we have the following data with us:
We have ten data points with us with their x and y coordinates given. Here we have variables represented as x and y but in practical situations, they can be different like monthly income or daily spent in market segmentation case. Let’s start with two clusters and first visualize the above data to get an idea about it:

The above graph shows the data we have at our hand (Graph 1) and the two possible clusters that can we make out of it (Graph 2). Note that, the decision of finding the number of clusters (2 in the present case) in any data is totally arbitrary & based upon a random guess. Later you will find out how this random guess at first, lead us to the optimum number of clusters possible in data. The question now is how to calculate and analytically find out the two clusters.
The logic behind the method
  1. First, assume two random points anywhere near the data & consider them as centre of two clusters (centroids)
  2. Find the distance of every data point from both the centers
  3. Assign every data point to the centroid to which it is nearest, hence make two clusters
  4. Calculate the center of both the formed clusters and shift the centroids there.
  5. Go to step 1 and repeat the process until there is no change in the clusters formed.

Let’s apply the above logic to the given data. We know in cartesian coordinate system the distance between two points (x1, y1) and (x2, y2) is given by:
Using the above formula, we can calculate the distance between each point from assumed centers. Let’s assume our centroids at C1 = (4,1) and C2 = (6,1), graphically speaking, as shown in the below graph:

If we calculate the distance of each point from two centroids the result will be as shown in the below table:
Table -1

Based upon the above distance values, each point will be assigned to the centroid to which its distance is minimum or to which it is nearer e.g. consider the first data point, its distance form C1 is 3.6 and from C2 is 5.4. As it is nearer to C1, it will be assigned to this particular centroid. Doing same to each and every point, the assignment will be as shown below:
Table-2


As you can see from the above table, each point, based upon its distance from the assumed centroids has been assigned to a centroid. After assignment, the new position of the centroids is calculated by calculating the centre of the points assigned to each cluster by calculating mean distance of each coordinate as shown in the table. Hence the new position of centroids will be C1 = (2.6, 3.8) and C2 = (7.6, 3.4) as shown in the graph below:

The above graph shows how the single loop of calculation has bought the two centroids closer to data. If you run another loop of calculations, you can see that the centroids do not move any further and there is no data point which changes its centroid from C1 to C2 or vice versa. This is where the calculation loop or recursion is stopped.
The above process of assigning data to centroids to make clusters out of it is called K-Means clustering and the value of K is the number of clusters formed e.g. the value of K in current case will be 2.
The final two clusters are as shown below:
How to decide the value of K?
We started with K=2 with random coordinates, i.e. we assigned random values to both, number of centroids and position of centroids. Although, we finally found out the way to calculate the final position of centroids the question that still remains is that what will be the value of K which would fit the given data in most optimum manner. We would try to find the answer to this question too but let’s first understand what cost function is.
Cost Function
It is a function which gives the measure of imperfection of the model with respect to every value of K. Ideally speaking, each point in a cluster should be as close to its centroid as possible or the distance between each point in a particular cluster from its centroid should be zero. We will calculate this distance & consider the sum of squares of the distances as the cost or measure of imperfection. We will repeat the same for each cluster & will find the value of K which will minimize this cost.
Refer to Table-1 and Table-2, we have each point sorted in a cluster and distance of the points from their respective centroids which can be summarized as below:

 Each of the distance is squared & added together, the total sum for both the clusters is 15.72 + 10.76 = 26.84. This figure is the cost of the model when we set K = 2. Likewise, the cost can be calculated for different values of K and when the cost values are plotted against K values, we will get a graph as shown below:
The optimum value of K, as per the above graph would be the one where it shows maximum deflection (marked in red) or where the cost curve makes an elbow. Although the higher values of K reduce the cost further but they lead to over-fitting. This method of finding out the value of optimum K is known as elbow method.
The Python code for whatever we have done so far is shown below:

This is all about K-Means clustering & how Optimum value of K and position of centroids is found out. Hope you like it. Do post your comments
Further reading:

Thanks,
Have a good time 😊
If you have any queries about the article, you can reach me on LinkedIn 

Comments