Photo by Ellen Qin on Unsplash |

**This article will try to:**

- Explain graphically the basics of K-Means Clustering, an unsupervised machine learning algorithm
- Discuss the method of finding an optimum value of
**K**and centroid location of the clusters

**What you are already supposed to know:**

- Basic mathematics and Euclidian geometry

In machine learning, one of the
frequently encountered problems is grouping similar data together. You know the
income level of people and now you want to group people with similar income
levels together. You want to know who are the people with low-income power, the
people with high or very high-income power, which you think can be helpful in
devising a perfect marketing strategy. You have the shopping data of customers
with you and now you want to group customers with similar shopping preferences
together or you being the bio student want to know which cells share similar
properties based upon the data about the cells you have at your hand.

All the above-stated problems
come under the domain of an unsupervised machine learning method called clustering.
Although there are a number of clustering algorithms there, when it comes to
the simplest one, the award will go to K-Means clustering. To understand the
algorithm let’s suppose we have the following data with us:

We have ten data points with us
with their

**x**and**y**coordinates given. Here we have variables represented as**x**and**y**but in practical situations, they can be different like**monthly income**or**daily spent**in market segmentation case. Let’s start with two clusters and first visualize the above data to get an idea about it:
The above graph shows the data we have at our hand (Graph 1)
and the two possible clusters that can we make out of it (Graph 2). Note that,
the decision of finding the number of clusters (2 in the present case) in any data is totally arbitrary
& based upon a random guess. Later you will find out how this random guess
at first, lead us to the optimum number of clusters possible in data. The question
now is how to calculate and analytically find out the two clusters.

**The logic behind the method**

- First, assume two random points anywhere near the data & consider them as centre of two clusters (centroids)
- Find the distance of every data point from both the centers
- Assign every data point to the centroid to which it is nearest, hence make two clusters
- Calculate the center of both the formed clusters and shift the centroids there.
- Go to step 1 and repeat the process until there is no change in the clusters formed.

Let’s apply the above logic to the given data. We know in
cartesian coordinate system the distance between two points (x1, y1) and (x2,
y2) is given by:

Using the above formula, we can calculate the distance
between each point from assumed centers. Let’s assume our centroids at C1 =
(4,1) and C2 = (6,1), graphically speaking, as shown in the below graph:

If we calculate the distance of each point from two
centroids the result will be as shown in the below table:

Table -1 |

Based upon the above distance values, each point will be
assigned to the centroid to which its distance is minimum or to which it is
nearer e.g. consider the first data point, its distance form

**C1**is 3.6 and from**C2**is 5.4. As it is nearer to C1, it will be assigned to this particular centroid. Doing same to each and every point, the assignment will be as shown below:Table-2 |

As you can see from the above table, each point, based upon
its distance from the assumed centroids has been assigned to a centroid. After
assignment, the new position of the centroids is calculated by calculating the
centre of the points assigned to each cluster by calculating mean distance of
each coordinate as shown in the table. Hence the new position of centroids will
be C1 = (2.6, 3.8) and C2 = (7.6, 3.4) as shown in the graph below:

The above graph shows how the single loop of calculation has
bought the two centroids closer to data. If you run another loop of
calculations, you can see that the centroids do not move any further and there
is no data point which changes its centroid from C1 to C2 or vice versa. This
is where the calculation loop or recursion is stopped.

The above process of assigning data to centroids to make
clusters out of it is called K-Means clustering and the value of

**K**is the number of clusters formed e.g. the value of**K**in current case will be 2.
The final two clusters are as shown below:

**How to decide the value of K?**

We started with K=2 with random coordinates, i.e. we
assigned random values to both, number of centroids and position of centroids.
Although, we finally found out the way to calculate the final position of
centroids the question that still remains is that what will be the value of

**K**which would fit the given data in most optimum manner. We would try to find the answer to this question too but let’s first understand what cost function is.**Cost Function**

It is a function which gives the measure of imperfection of
the model with respect to every value of

**K**. Ideally speaking, each point in a cluster should be as close to its centroid as possible or the distance between each point in a particular cluster from its centroid should be zero. We will calculate this distance & consider the sum of squares of the distances as the cost or measure of imperfection. We will repeat the same for each cluster & will find the value of**K**which will minimize this cost.
Refer to Table-1 and Table-2, we have each point sorted in a
cluster and distance of the points from their respective centroids which can be
summarized as below:

**K**and when the cost values are plotted against

**K**values, we will get a graph as shown below:

The optimum value of

The Python code for whatever we have done so far is shown below:

**K**, as per the above graph would be the one where it shows maximum deflection (marked in red) or where the cost curve makes an elbow. Although the higher values of**K**reduce the cost further but they lead to over-fitting. This method of finding out the value of optimum**K**is known as elbow method.The Python code for whatever we have done so far is shown below:

This is all about K-Means clustering & how Optimum value
of

**K**and position of centroids is found out. Hope you like it. Do post your comments**Further reading:**

Thanks,

Have a good time 😊

If you have any queries about the article, you
can reach me on LinkedIn

## Comments

## Post a comment