Photo by Abraham Osorio on Unsplash |

Regression is most probably the first machine learning
algorithm that one learns. It is basic, simple and simultaneously a very useful
tool that solves a lot of machine learning problems. This article is about Ridge Regression, a
modification over the

**Linear Regression**to make it more suitable for**feature selection**. The whole story is divided into four equally important parts as mentioned below:**Linear Regression:**The basic idea of Ordinary Least Squares in the linear regression is explained.**Feature Selection:**What feature selection in machine learning is and how it is important is illustrated.**Parameter calculation:**What parameters are calculated in linear regression with graphical representation.**Ridge Regression:**The final section, combines the above concepts to explain ridge regression.

To understand this concept, read all the four parts very
carefully. Now, let’s start:

**Linear Regression**

Suppose we are given a two-dimensional data set, shown and plotted
below, and we intend to fit a linear model into it. The black dots in the graph
show the data points that are shown in the left table. The blue line is the
linear model fitted into the data & the red arrows show the difference in
the predictions & actual dependent variable values (Y).

In a linear regression model, a line is fitted into the data
in such a way that d1

^{2 }+ d2^{2}+ d3^{2}+ d4^{2 }+ d5^{2}is minimized i.e. the square of the residuals (difference between the actual values and predicted values) is minimized. In more general form it can be represented as shown below:
The quantity is called residual sum of squares (RSS). The
method of finding the linear model in this way is called ordinary least squares
method. The fitted model is of the form as given below:

**Bo**in the above equation is the intercept and

**B1**is the slope of the model,

**X**and

**Y**are the independent and dependent variable respectively.

**Feature Selection**

In any machine learning problem, we are given predictors
also called features or independent variables and based upon the data provided
we need to understand the relationship of these variables to response variable
also called dependent variable. In simple terms we need to find out the below
relationship:

We have

**y**as the response variable and**p**predictors (x1, x2, … xp) and the above equation represents the actual functional relationship between the two. In machine learning we need to find this relationship as accurately as possible. The search for the model brings a number of challenges with it like insufficient or missing data, missing predictors, irrelevant predictors, correlation among the predictors, wrong format of data etc. All the challenges that we face in machine learning are mitigated through different methods. The challenge that we will address in this topic is“irrelevant predictors”. I will illustrate this problem with an example.
Consider understanding the relationship between amount of
rainfall with temperature, humidity, geographical location and hair colour of
the people in the area. The intuition suggests that the rainfall occurrence can
have the relationship with first three predictors but there seems to be no
logical link between the rainfall and hair colour of the people. Given the data
set, if we try to fit a regression model into it, the model will adjust itself
with the hair colour data too, which is wrong. Ideally, our model training
method should eliminate the unnecessary predictors or give them less weightage
at least. Similarly, if there is a correlation among predictors (like humidity
has with geographical location, coastal areas are more humid etc.), the model
should learn that too. The process of eliminating or reducing weightage of
unnecessary predictors is called feature selection. This concept is important in the current scenario as ridge regression directly deals with it.

**Parameter calculation in regression**

Although, the process of least squares method in regression
has already been discussed, it’s time to understand the concept graphically.
Consider the data set as given below. The system of rainfall amount with
temperature and humidity is represented by the data:

If we wish to fit a model in the above data through
regression, we need to fit a linear equation of type

**Y = Bo + B1 X1 + B2 X2**in it or more simply we need to calculate**Bo**,**B1**and**B2**. Here**X1**will be temperature and**X2**will be humidity. Let’s forget**Bo**(the intercept) for a while and choose arbitrary values for**B1**and**B2**, say**B1**= 1 &**B2**= 2. These are not the desired values but randomly chosen to understand a concept. Also, let’s take the value of zero for**Bo**. So now, we have parameter values with us, let’s predict the rainfall values and calculate RSS (sum of square of errors). The below table shows the required calculations:
As evident by the above table, for

**B1**= 1 and**B2**=2, the RSS value is 3743. Before concluding anything here, let’s assume a different value set for regression parameters. Let B1 = 3.407 and B2 = 1 and let’s keep ignoring**Bo**. If we make prediction calculations again & calculate RSS too, you will find out that it would again come out to be 3743 (almost). The point I want to make here is that there are different values of regression parameters for which the residual sum of squares is constant. These points if plotted will generate a plot as shown below:
We initially assumed

**Bo**to be zero, but for any value of**Bo**, the plots will still be the same, only shifted upwards or downwards depending upon the sign & magnitude of**Bo**.
The above plots are called

**cost contour plots of regression**. Each contour or loop is plotted between parameters**B1**and**B2**and represent a constant RSS value. In regression, we aim to find out the value represented by the dot at centre, which is both unique and represent minimum RSS.**Ridge Regression**

Ridge regression is a modification over least squares
regression to make it more suitable for feature selection. In ridge regression
we not only try to minimize the sum of square of residuals but another term
equal to the sum of square of regression parameters multiplied by a tuning
parameter. In other words, in ridge regression we try to minimize the below
equation:

The first term in the above equation is the sum of squares
of residuals and the second term is what
is specially added in ridge
regression. Since this is a special term introduced in ridge regression let’s
try to understand it further. For a data set with two predictors it will be

*a*(B1^{2}+ B2^{2}), where*a*is the tuning parameter. It is also called penalty term as it puts a constraint on the least squares method of regression. In the quest of minimizing it, it is constrained to a particular value, depicted by the below equation:
Look at the above equation carefully, it is the equation of
a shaded circle, with square of radius equal to

*s/a**(the constraint)*, a part of which is shown below:
Combining the above graph with the cost contour graph will
result in the graph as shown below:

The above graph gives the idea about ridge regression. It is
where the least squares condition meets the parameter constraint or penalty condition.
The radius of the circle representing constraint directly depends upon the tuning
parameter (

*a*). Smaller the value of tuning parameter, smaller the circle, higher the penalty. You can imagine it directly with the help of above graph. Smaller the value of tuning parameter, smaller will be the circle, closer to the origin will be the meeting point of two graphs, hence smaller the values of regression parameters.**Conclusion:**

In ridge regression, finding the parameters corresponding to minimum
residual sum of squares is not what is sought. A constraint is put on the parameters
to put a check on them and hence not allowing them to grow. This condition makes
sure that different parameters are given weightage differently and hence
becomes an important tool for feature selection. Please note that in ridge regression
no parameter of any predictor is made zero but parameter weightage is varied

That is all for ridge regression. Do post your
comments/suggestions. For any query regarding the topic, you can reach me on **LinkedIn**.

**Further Read:**

Thanks,

Have a nice time 😊

## Comments

## Post a comment