This article will discuss:
- Decision Trees – A famous classification algorithm in supervised machine learning
- The basics behind decision trees and how to develop one without a computer
What you are already supposed to
know:
- Basics of Statistics
When it comes to classification
problems in machine learning, algorithms like Logistic Regression,
Discriminant Analysis etc. are the ones that come into one’s mind. There is
another type of classification algorithm that is highly intuitive, easy to
interpret and understand called the Decision Tree algorithm. The tree-based models
can be used for both regression and classification but we will discuss only the
case of classification here. A typical decision tree is drawn below to make you
familiar with the concept:
The above decision tree shows the
chances of finding a TV in a random household. As it can be easily read, the
above tree suggests that if a person has a monthly income of more than 1000$ he
will be having a TV at home and else otherwise. The above tree can be seen as the
visual representation of the model developed from the below data. The “Monthly
income” is the predictor variable and “TV at home” is a response variable here.
Let’s have a dumb approach towards the above data set. There
are 10 data points in total and under the monthly income column, there are 6 True
values and 4 False values. Now, let’s analyze only the True
values first. Among the 6 True values, 4 have Yes against the
corresponding row under TV at home column while 2 have No against
them. The rows under consideration are shown below:
The above data is said to be impure as there are the mixed
response values (4 Yes & 2 No) against the same predictor
value of True. An ideal data set for us would have been the one having
only Yes against True values or No against the True
values. In such a scenario the decision making would have been quite easy and
straight forward. But since the response is mixed or impure, we won’t be making
any conclusions here, let’s analyze the other value under monthly income
column:
The above data set is impure too, for the same predictor
value of False, there is a mixed response variable value here. Since the
majority of values against True is Yes and that against False
is No, it means the probability of finding Yes against True
is more and so is No against False and hence the decision tree is
made as shown above. The single predictor variable model is as easy as it
appears. Things start to get complicated when more than one variable is
involved in the model. We will see that soon but let’s move our focus to a term
introduced above-called impurity.
Impurity and its measurement
Impurity means how heterogeneous our data is. As in the above
case, as already mentioned, things would have been more straight forward, if,
to every True value in predictor column there was a Yes in the
response column or vice versa. As in almost all practical situations, this is
not the case and we usually get mixed data set. We need to find a way to
measure impurity. There are already two methods there to measure the degree of impurity
or heterogeneity of data:
Although both of them measure
impurity to the same extent and result in the development of similar models, we
will only consider entropy here. Let’s discuss the various parameters in the
entropy equation. Pi is the probability of ith class in the
target/response variable e.g. if the target variable has classes distributed as
[0,0,1,0,1,1,0,1,1], P(0) = 4/9 and P(1) = 5/9. Applying the concept to Table 1,
we will be having P(Yes) = 5/10 and P(No) = 5/10. Moving to other parameters, notice the use of
base 2 logarithmic function. If you want to know more about the use of base 2
in the equation, you should read this.
Now having equation of Entropy at
our hand lets first calculate the entropy of overall data (Table 1). Since the
Yes values are present 5 times and No values are present 5 times
too, we have:
Let’s now calculate the entropy
of the data below the True split (Table 2). The target variable here is
“TV at home” having two classes into it “Yes” and “No”.
The Yes value is present 4 times and No is present 2 times.
Suppose the target
variable had only one class into it (completely pure), then the entropy would
have been:
This shows the data set
having complete homogeneity into it has entropy = 0 and it would be easy to
show that the completely impure data set (each class observation present in
equal number) has an entropy of 1.
Likewise, the entropy
of the data under the False split would be:
The net entropy of the data after the Income the variable is introduced can be calculated by taking the weighted average of the two entropy values under its splits
The flow of entropy needs
to be clearly understood. Initially it was 1, the True split reduced it
to 0.917 and False split reduced it to 0.811. The net reduction by the
Income variable is 0.874. This is what we expect, the reduction in entropy
(impurity of heterogeneity) by introduction of variables. The total reduction
is 1 – 0.874 = 0.126. This number (0.126) is called information gain and
is a very important parameter in decision tree model development.
This is how degree of
impurity is calculated through entropy calculation. Now let’s apply this
concept in the model development
Model Development
Now in addition to the
above information of guessing TV possession through monthly income, suppose it
was lately bought to our notice that not only monthly income but there is
another variable which influences the result and the variable is the location
of a person, whether he lives in an urban area or rural. Let the updated
data-set be as shown below:
The above data-set presents a
dilemma to us for whatever we know so far. The dilemma is: whether to start our
decision tree with income predictor or location predictor. We will try to
resolve this dilemma through entropy calculation for both the predictors but
let’s present the location variable against the target variable in segregated form
as below:
Let’s calculate the entropy
under both the income & location variable separately:
As you can see, entropy down the
monthly income is lower than that of location variable, we can say that the
heterogeneity of the system is reduced more by monthly income variable than the
location variable and the information gain is 1 – 0.874 = 0.126. Going by the
information gain criteria, our decision tree should start with the monthly
income variable again as shown below:
Although we have found out our first variable, we can’t just
finish it up here. We are having another variable at our disposal, which can
increase the efficiency of our model. Now the next question is where to place
that, down the tree. We have two possible places for location variable &
both will pass the different data values to it & hence different
information gain. We will place it there, where the information gain in maximum.
Let’s calculate and find it out.
The column for True values is shown below:
The value under the monthly income column is same. Segregating the location column based on column values, we will get the tables as shown:
The above calculations can be visualized
as shown below:
If we place the Location variable under False split, the
calculations & data tables would be as:
The value under the monthly income column is same. Segregating
the location column based on column values, we will get the tables as shown:
The above calculations can be visualized
as under:
Since Information gain under False split
is less than the True split, the decision tree will look like:
The Probability of finding Yes value for Urban
predictor value is more than No value and hence the above arrangement.
The above tree suggests that if the income is less than
1000$ there would be no TV in the household and if the income is more than
1000$ then we have to check the location of the household, if urban, the TV can
be found and else otherwise.
This is how decision trees are developed. The results by
decision trees are often less accurate than that by other classification models
likes Logistic regression or Linear Discriminant Analysis but they are useful
when the need of understanding or interpreting the system is more than that of
prediction.
Terminology:
Now that you have understood the concept of developing a
decision tree let’s understand the various terms used in the decision trees.
Node: A node is something which represents a variable
in the decision tree e.g. income and location are the nodes.
Root Node: The Node from where the tree starts e.g.
income in the above decision tree.
Leaf Nodes: The bottom most nodes, where the value of
the target variable is decided e.g. location is the leaf node in above tree.
Splitting: The
process of splitting a node in two or more sub nodes is called splitting
The Python code for whatever we have done so far is shown
below:
If you find above concept interesting, you can further read
the below-mentioned topics:
Thanks,
Have a good time :)
Comments
Post a comment