This article will discuss:

- Decision Trees – A famous classification algorithm in supervised machine learning
- The basics behind decision trees and how to develop one without a computer

What you are already supposed to
know:

- Basics of Statistics

When it comes to classification
problems in machine learning, algorithms like Logistic Regression,
Discriminant Analysis etc. are the ones that come into one’s mind. There is
another type of classification algorithm that is highly intuitive, easy to
interpret and understand called the Decision Tree algorithm. The tree-based models
can be used for both regression and classification but we will discuss only the
case of classification here. A typical decision tree is drawn below to make you
familiar with the concept:

The above decision tree shows the
chances of finding a TV in a random household. As it can be easily read, the
above tree suggests that if a person has a monthly income of more than 1000$ he
will be having a TV at home and else otherwise. The above tree can be seen as the
visual representation of the model developed from the below data. The “

**Monthly income”**is the predictor variable and “**TV at home”**is a response variable here.
Let’s have a dumb approach towards the above data set. There
are 10 data points in total and under the monthly income column, there are 6

**True**values and 4**False**values. Now, let’s analyze only the**True**values first. Among the 6**True**values, 4 have**Yes**against the corresponding row under**TV at home**column while 2 have**No**against them. The rows under consideration are shown below:
The above data is said to be impure as there are the mixed
response values (4

**Yes**& 2**No**) against the same predictor value of**True.**An ideal data set for us would have been the one having only**Yes**against**True**values or**No**against the**True**values. In such a scenario the decision making would have been quite easy and straight forward. But since the response is mixed or impure, we won’t be making any conclusions here, let’s analyze the other value under monthly income column:
The above data set is impure too, for the same predictor
value of

**False**, there is a mixed response variable value here. Since the majority of values against**True**is**Yes**and that against**False**is**No**,**it means the probability of finding****Yes**against**True**is more and so is**No**against**False**and hence the decision tree is made as shown above. The single predictor variable model is as easy as it appears. Things start to get complicated when more than one variable is involved in the model. We will see that soon but let’s move our focus to a term introduced above-called impurity.**Impurity and its measurement**

Impurity means how heterogeneous our data is. As in the above
case, as already mentioned, things would have been more straight forward, if,
to every

**True**value in predictor column there was a**Yes**in the response column or vice versa. As in almost all practical situations, this is not the case and we usually get mixed data set. We need to find a way to measure impurity. There are already two methods there to measure the degree of impurity or heterogeneity of data:
Although both of them measure
impurity to the same extent and result in the development of similar models, we
will only consider entropy here. Let’s discuss the various parameters in the
entropy equation.

*Pi*is the probability of i^{th}class in the target/response variable e.g. if the target variable has classes distributed as [0,0,1,0,1,1,0,1,1], P(0) = 4/9 and P(1) = 5/9. Applying the concept to Table 1, we will be having P(Yes) = 5/10 and P(No) = 5/10. Moving to other parameters, notice the use of base 2 logarithmic function. If you want to know more about the use of base 2 in the equation, you should read**this**.
Now having equation of Entropy at
our hand lets first calculate the entropy of overall data (Table 1). Since the

**Yes**values are present 5 times and**No**values are present 5 times too, we have:
Let’s now calculate the entropy
of the data below the

**True**split (Table 2). The target variable here is “**TV at home”**having two classes into it “**Yes**” and “**No**”. The**Yes**value is present 4 times and**No**is present 2 times.
Suppose the target
variable had only one class into it (completely pure), then the entropy would
have been:

This shows the data set
having complete homogeneity into it has entropy = 0 and it would be easy to
show that the completely impure data set (each class observation present in
equal number) has an entropy of 1.

Likewise, the entropy
of the data under the

**False**split would be:
The net entropy of the data after the Income the variable is introduced can be calculated by taking the weighted average of the two entropy values under its splits

The flow of entropy needs
to be clearly understood. Initially it was 1, the

**True**split reduced it to 0.917 and**False**split reduced it to 0.811. The net reduction by the Income variable is 0.874. This is what we expect, the reduction in entropy (impurity of heterogeneity) by introduction of variables. The total reduction is 1 – 0.874 = 0.126. This number (0.126) is called**information gain**and is a very important parameter in decision tree model development.
This is how degree of
impurity is calculated through entropy calculation. Now let’s apply this
concept in the model development

**Model Development**

Now in addition to the
above information of guessing TV possession through monthly income, suppose it
was lately bought to our notice that not only monthly income but there is
another variable which influences the result and the variable is the location
of a person, whether he lives in an urban area or rural. Let the updated
data-set be as shown below:

The above data-set presents a
dilemma to us for whatever we know so far. The dilemma is: whether to start our
decision tree with income predictor or location predictor. We will try to
resolve this dilemma through entropy calculation for both the predictors but
let’s present the location variable against the target variable in segregated form
as below:

Let’s calculate the entropy
under both the income & location variable separately:

As you can see, entropy down the
monthly income is lower than that of location variable, we can say that the
heterogeneity of the system is reduced more by monthly income variable than the
location variable and the information gain is 1 – 0.874 = 0.126. Going by the
information gain criteria, our decision tree should start with the monthly
income variable again as shown below:

Although we have found out our first variable, we can’t just
finish it up here. We are having another variable at our disposal, which can
increase the efficiency of our model. Now the next question is where to place
that, down the tree. We have two possible places for location variable &
both will pass the different data values to it & hence different
information gain. We will place it there, where the information gain in maximum.
Let’s calculate and find it out.

The column for

**True**values is shown below:
The value under the monthly income column is same. Segregating the location column based on column values, we will get the tables as shown:

The above calculations can be visualized
as shown below:

If we place the Location variable under False split, the
calculations & data tables would be as:

The value under the monthly income column is same. Segregating
the location column based on column values, we will get the tables as shown:

The above calculations can be visualized
as under:

Since Information gain under False split
is less than the True split, the decision tree will look like:

The Probability of finding

**Yes**value for**Urban**predictor value is more than**No**value and hence the above arrangement.
The above tree suggests that if the income is less than
1000$ there would be no TV in the household and if the income is more than
1000$ then we have to check the location of the household, if urban, the TV can
be found and else otherwise.

This is how decision trees are developed. The results by
decision trees are often less accurate than that by other classification models
likes Logistic regression or Linear Discriminant Analysis but they are useful
when the need of understanding or interpreting the system is more than that of
prediction.

**Terminology:**

Now that you have understood the concept of developing a
decision tree let’s understand the various terms used in the decision trees.

**Node**: A node is something which represents a variable in the decision tree e.g. income and location are the nodes.

**Root Node**: The Node from where the tree starts e.g. income in the above decision tree.

**Leaf Nodes**: The bottom most nodes, where the value of the target variable is decided e.g. location is the leaf node in above tree.

**Splitting**: The process of splitting a node in two or more sub nodes is called splitting

The Python code for whatever we have done so far is shown
below:

If you find above concept interesting, you can further read
the below-mentioned topics:

Thanks,

Have a good time :)

## Comments

## Post a comment