DBSCAN – Make density-based clusters by hand

Photo by Alice Pasqual on Unsplash

DBSCAN stands for Density-Based Spatial Clustering Application with Noise. It is an unsupervised machine learning algorithm that makes clusters based upon the density of the data points or how close the data is. That said, the points which are outside the dense regions are excluded and treated as noise or outliers. This characteristic of the DBSCAN algorithm makes it a perfect fit for outlier detection and making clusters of arbitrary shape.  The algorithms like K-Means Clustering lack this property and make spherical clusters only and are very sensitive to outliers. By sensitivity, I mean the sphere-shaped clusters made through K-Means can easily get influenced by the introduction of a single outlier as they are included too.
Often the tutorials/articles about DBSCAN are included with heavy datasets and hardcore programs which makes it difficult for readers to absorb the core concept but in this article, we will apply this algorithm to a very small data set to explain it. Some calculations are skipped and the calculated tables are directly presented. It’s recommended to verify all such calculations, which will make the concept more vivid to you. Let’s take a dataset of 13 points as shown and plotted below:

A two-dimensional data is presented for easy visualization and understanding, else DBSCAN can handle multi-dimensional data too. The possible clusters from the data have been marked in the above graph to visualize the clusters that we want. The points (1,5) (4,3) (5,6) in the above graph fall outside the markings and hence should be treated as outliers. The DBSCAN algorithm should actually make clusters and exclude outliers as we did in the graph. Let’s first understand the algorithm and various steps involved in it.
Logic and Steps:
The DBSCAN algorithm takes two input parameters. Radius around each point (eps) and the minimum number of data points that should be around that point within that radius (MinPts). For example, consider the point (1.5,2.5), if we take eps = 0.3, then the circle around the point with radius = 0.3, will contain only one other point inside it (1.2,2.5) as shown below:

Hence for (1.5, 2.5) when eps = 0.3, the number of neighbourhood point(s) is just one. In DBSCAN each point is checked for these two parameters and the decision about the clustering is made as described through the below steps:
  1. Choose a value for eps and MinPts
  2. For a particular data point (x) calculate its distance from every other datapoint.
  3. Find all the neighbourhood points of x which fall inside the circle of radius (eps) or simply whose distance from x is smaller than or equal to eps.
  4. Treat x as visited and If the number of neighbourhood points around x are greater or equal to MinPts then treat x as a core point and if it is not assigned to any cluster, create a new cluster and assign it to that.
  5. If the number of neighbourhood points around x are less than MinPts and it has a core point in its neighbourhood, treat it as a border point.
  6. Include all the density connected points as a single cluster. (What density connected points mean is described later)
  7. Repeat the above steps for every unvisited point in the data set and find out all core, border and outlier points.

Please note that the above steps constitute a recursive process. Upon 1st loop of calculations, a point may not be treated as a border point to any core point but it may be treated so in next loop.
If the number of neighbourhood points around x is greater or equal to MinPts then x is treated as a core point, if the neighbourhood points around x are less than MinPts but is close to a core point then x is treated as a border point. If x is neither core nor border point then x is treated as an outlier. The below graph gives an idea about it. We choose eps = 0.6 and MinPts =4, the point tagged as core point has 4 other points (>= MinPts) in its neighbourhood & the one tagged as border point is in the neighbourhood of a core point but has only one point in its neighbourhood (< MinPts). The outlier point is one which is neither border point nor core point.

Algorithm in action
Let’s now apply the DBSCAN algorithm to the above dataset to find out clusters. We have to choose first the values for eps and MinPts. Let’s choose eps = 0.6 and MinPts = 4. Let’s consider the first data point in the dataset (1,2) & calculate its distance from every other data point in the data set. The Calculated values are shown below:

As evident from the above table, the point (1, 2) has only two other points in its neighbourhood (1, 2.5), (1.2, 2.5) for the assumed value of eps, as its less than MinPts, we can’t declare it as a core point. Let’s repeat the above process for every point in the dataset and find out the neighbourhood of each. The calculations when repeated can be summarized as below:

Observe the above table carefully, the left-most column contains all the points we have in our data set. To the right of them are the data points which are there in their neighbourhood i.e. the points whose distance from them is less or equal to the eps value. There are three points in the data set, (2.8, 4.5) (1.2, 2.5) (1, 2.5) that have 4 neighbourhood points around them, hence they would be called core points and as already mentioned, if the core point is not assigned to any cluster, a new cluster is formed. Hence, (2.8, 4.5) is assigned to a new cluster, Cluster 1 and so is the point (1.2, 2.5), Cluster 2. Also observe that that the core points (1.2, 2.5) and (1, 2.5) share at least one common neighbourhood point (1,2) so, they are assigned to the same cluster. The below table shows the categorization of all the data points into core, border and outlier points. Have a look:
There are three types of points in the dataset as detected by the DBSCAN algorithm, core, border and outliers. Every core point will be assigned to a new cluster, unless some of the core points share neighbourhood points, they will be included into the same cluster. Every border point will be assigned to the cluster based upon the core point in its neighbourhood e.g. the first point (1, 2) is a border point and has a core point (1.2, 2.5) in its neighbourhood, which is included in Cluster 2, hence, the point (1,2) will be included in the Cluster 2 too. The whole categorization can be summarized as below:
Three terms are necessary to understand in order to understand DBSCAN:
1.       Direct density reachable: A point is called direct density reachable, if it has a core point in its neighbourhood. Consider the point (1, 2), it has a core point (1.2, 2.5) in its neighbourhood, hence, it will be a direct density reachable point.
2.       Density Reachable: A point is called density reachable from another point if they are connected through a series of core points. For example, consider the points (1, 3) and (1.5, 2.5), since they are connected through a core point (1.2, 2.5), they are called density reachable from each other.
3.       Density Connected: Two points are called density connected, if there is a core point which is density reachable from both the points.
The algorithm of DBSCAN must be clear as of now. It is beautiful as it excludes the outliers and makes the clusters of arbitrary shape unlike k-means clustering.
For any suggestions regarding the article, you can post your comments below or reach me on LinkedIn.
Thanks,
Have a nice time 😊

Comments