Photo by Alice Pasqual on Unsplash |

DBSCAN stands for Density-Based Spatial Clustering
Application with Noise. It is an unsupervised machine learning algorithm that makes clusters based upon the density of the data points or how close the data
is. That said, the points which are outside the dense regions are excluded and
treated as noise or outliers. This characteristic of the DBSCAN algorithm makes it a perfect fit for outlier detection and making clusters of arbitrary shape. The algorithms like

**K-Means Clustering**lack this property and make spherical clusters only and are very sensitive to outliers. By sensitivity, I mean the sphere-shaped clusters made through K-Means can easily get influenced by the introduction of a single outlier as they are included too.
Often the tutorials/articles about DBSCAN are included with
heavy datasets and hardcore programs which makes it difficult for readers to
absorb the core concept but in this article, we will apply this algorithm to a
very small data set to explain it. Some calculations are skipped and the
calculated tables are directly presented. It’s recommended to verify all such calculations, which will make the concept more vivid to you. Let’s take a dataset of 13 points as shown and plotted below:

A two-dimensional data is presented for easy visualization
and understanding, else DBSCAN can handle multi-dimensional data too. The possible
clusters from the data have been marked in the above graph to visualize the
clusters that we want. The points (1,5) (4,3) (5,6) in the above graph fall
outside the markings and hence should be treated as outliers. The DBSCAN
algorithm should actually make clusters and exclude outliers as we did in the
graph. Let’s first understand the algorithm and various steps involved in it.

**Logic and Steps:**

The DBSCAN algorithm takes two input parameters. Radius
around each point (

*eps*) and the minimum number of data points that should be around that point within that radius (*MinPts*). For example, consider the point (1.5,2.5), if we take*eps*= 0.3, then the circle around the point with radius = 0.3, will contain only one other point inside it (1.2,2.5) as shown below:
Hence for (1.5, 2.5) when

*eps*= 0.3, the number of neighbourhood point(s) is just one. In DBSCAN each point is checked for these two parameters and the decision about the clustering is made as described through the below steps:- Choose a value for
*eps*and*MinPts* - For a particular data point (
**x**) calculate its distance from every other datapoint. - Find all the neighbourhood points of
**x**which fall inside the circle of radius (*eps*) or simply whose distance from**x**is smaller than or equal to*eps.* - Treat
**x**as**visited**and If the number of neighbourhood points around**x**are greater or equal to*MinPts*then treat**x**as a**core point**and if it is not assigned to any cluster, create a new cluster and assign it to that. - If the number of neighbourhood points around
**x**are less than MinPts and it has a core point in its neighbourhood, treat it as a border point. - Include all the
**density connected points**as a single cluster. (What density connected points mean is described later) - Repeat the above steps for every unvisited point in the data set and find out all core, border and outlier points.

Please note that the above steps constitute a recursive process. Upon 1

^{st}loop of calculations, a point may not be treated as a border point to any core point but it may be treated so in next loop.
If the number of neighbourhood points around

**x**is greater or equal to MinPts then**x**is treated as a core point, if the neighbourhood points around**x**are less than MinPts but is close to a core point then**x**is treated as a border point. If**x**is neither core nor border point then**x**is treated as an outlier. The below graph gives an idea about it. We choose eps = 0.6 and MinPts =4, the point tagged as core point has 4 other points (>= MinPts) in its neighbourhood & the one tagged as border point is in the neighbourhood of a core point but has only one point in its neighbourhood (< MinPts). The outlier point is one which is neither border point nor core point.**Algorithm in action**

Let’s now apply the DBSCAN algorithm to the above dataset to
find out clusters. We have to choose first the values for

*eps*and*MinPts*. Let’s choose*eps*= 0.6 and*MinPts*= 4. Let’s consider the first data point in the dataset (1,2) & calculate its distance from every other data point in the data set. The Calculated values are shown below:
As evident from the above table, the point (1, 2) has only two
other points in its neighbourhood (1, 2.5), (1.2, 2.5) for the assumed value of

*eps*, as its less than MinPts, we can’t declare it as a core point. Let’s repeat the above process for every point in the dataset and find out the neighbourhood of each. The calculations when repeated can be summarized as below:
Observe the above table carefully, the left-most column
contains all the points we have in our data set. To the right of them are the
data points which are there in their neighbourhood i.e. the points whose distance
from them is less or equal to the

*eps*value. There are three points in the data set, (2.8, 4.5) (1.2, 2.5) (1, 2.5) that have 4 neighbourhood points around them, hence they would be called core points and as already mentioned, if the core point is not assigned to any cluster, a new cluster is formed. Hence, (2.8, 4.5) is assigned to a new cluster, Cluster 1 and so is the point (1.2, 2.5), Cluster 2. Also observe that that the core points (1.2, 2.5) and (1, 2.5) share at least one common neighbourhood point (1,2) so, they are assigned to the same cluster. The below table shows the categorization of all the data points into core, border and outlier points. Have a look:
There are three types of points in the dataset as detected
by the DBSCAN algorithm, core, border and outliers. Every core point will be
assigned to a new cluster, unless some of the core points share neighbourhood
points, they will be included into the same cluster. Every border point will be
assigned to the cluster based upon the core point in its neighbourhood e.g. the
first point (1, 2) is a border point and has a core point (1.2, 2.5) in its
neighbourhood, which is included in Cluster 2, hence, the point (1,2) will be
included in the Cluster 2 too. The whole categorization can be summarized as
below:

Three terms are necessary to understand in order to
understand DBSCAN:

1.

**Direct density reachable**: A point is called direct density reachable, if it has a core point in its neighbourhood. Consider the point (1, 2), it has a core point (1.2, 2.5) in its neighbourhood, hence, it will be a direct density reachable point.
2.

**Density Reachable**: A point is called density reachable from another point if they are connected through a series of core points. For example, consider the points (1, 3) and (1.5, 2.5), since they are connected through a core point (1.2, 2.5), they are called density reachable from each other.
3.

**Density Connected**: Two points are called density connected, if there is a core point which is density reachable from both the points.
The algorithm of DBSCAN must be clear as of now. It is
beautiful as it excludes the outliers and makes the clusters of arbitrary shape
unlike

**k-means clustering**.
For any suggestions regarding the article, you can post your
comments below or reach me on **LinkedIn**.

Thanks,

Have a nice time 😊

## Comments

## Post a comment