Ridge Regression – A graphical tale of two concepts

Photo by Abraham Osorio on Unsplash

Regression is most probably the first machine learning algorithm that one learns. It is basic, simple and simultaneously a very useful tool that solves a lot of machine learning problems. This article is about Ridge Regression, a modification over the Linear Regression to make it more suitable for feature selection. The whole story is divided into four equally important parts as mentioned below:
  1. Linear Regression: The basic idea of Ordinary Least Squares in the linear regression is explained.
  2. Feature Selection: What feature selection in machine learning is and how it is important is illustrated.
  3. Parameter calculation: What parameters are calculated in linear regression with graphical representation.
  4. Ridge Regression: The final section, combines the above concepts to explain ridge regression.

To understand this concept, read all the four parts very carefully. Now, let’s start:
Linear Regression
Suppose we are given a two-dimensional data set, shown and plotted below, and we intend to fit a linear model into it. The black dots in the graph show the data points that are shown in the left table. The blue line is the linear model fitted into the data & the red arrows show the difference in the predictions & actual dependent variable values (Y).

In a linear regression model, a line is fitted into the data in such a way that d12 + d22 + d32 + d42 + d52 is minimized i.e. the square of the residuals (difference between the actual values and predicted values) is minimized. In more general form it can be represented as shown below:
The quantity is called residual sum of squares (RSS). The method of finding the linear model in this way is called ordinary least squares method. The fitted model is of the form as given below:
Bo in the above equation is the intercept and B1 is the slope of the model, X and Y are the independent and dependent variable respectively.
Feature Selection
In any machine learning problem, we are given predictors also called features or independent variables and based upon the data provided we need to understand the relationship of these variables to response variable also called dependent variable. In simple terms we need to find out the below relationship:
We have y as the response variable and p predictors (x1, x2, … xp) and the above equation represents the actual functional relationship between the two. In machine learning we need to find this relationship as accurately as possible. The search for the model brings a number of challenges with it like insufficient or missing data, missing predictors, irrelevant predictors, correlation among the predictors, wrong format of data etc. All the challenges that we face in machine learning are mitigated through different methods. The challenge that we will address in this topic is“irrelevant predictors”. I will illustrate this problem with an example.
Consider understanding the relationship between amount of rainfall with temperature, humidity, geographical location and hair colour of the people in the area. The intuition suggests that the rainfall occurrence can have the relationship with first three predictors but there seems to be no logical link between the rainfall and hair colour of the people. Given the data set, if we try to fit a regression model into it, the model will adjust itself with the hair colour data too, which is wrong. Ideally, our model training method should eliminate the unnecessary predictors or give them less weightage at least. Similarly, if there is a correlation among predictors (like humidity has with geographical location, coastal areas are more humid etc.), the model should learn that too. The process of eliminating or reducing weightage of unnecessary predictors is called feature selection. This concept is important in the current scenario as ridge regression directly deals with it.
Parameter calculation in regression
Although, the process of least squares method in regression has already been discussed, it’s time to understand the concept graphically. Consider the data set as given below. The system of rainfall amount with temperature and humidity is represented by the data:
If we wish to fit a model in the above data through regression, we need to fit a linear equation of type Y = Bo + B1 X1 + B2 X2 in it or more simply we need to calculate Bo, B1 and B2. Here X1 will be temperature and X2 will be humidity. Let’s forget Bo (the intercept) for a while and choose arbitrary values for B1 and B2, say B1 = 1 & B2 = 2. These are not the desired values but randomly chosen to understand a concept. Also, let’s take the value of zero for Bo. So now, we have parameter values with us, let’s predict the rainfall values and calculate RSS (sum of square of errors). The below table shows the required calculations:
As evident by the above table, for B1 = 1 and B2 =2, the RSS value is 3743. Before concluding anything here, let’s assume a different value set for regression parameters. Let B1 = 3.407 and B2 = 1 and let’s keep ignoring Bo. If we make prediction calculations again & calculate RSS too, you will find out that it would again come out to be 3743 (almost). The point I want to make here is that there are different values of regression parameters for which the residual sum of squares is constant. These points if plotted will generate a plot as shown below:
We initially assumed Bo to be zero, but for any value of Bo, the plots will still be the same, only shifted upwards or downwards depending upon the sign & magnitude of Bo.
The above plots are called cost contour plots of regression. Each contour or loop is plotted between parameters B1 and B2 and represent a constant RSS value. In regression, we aim to find out the value represented by the dot at centre, which is both unique and represent minimum RSS.
Ridge Regression
Ridge regression is a modification over least squares regression to make it more suitable for feature selection. In ridge regression we not only try to minimize the sum of square of residuals but another term equal to the sum of square of regression parameters multiplied by a tuning parameter. In other words, in ridge regression we try to minimize the below equation:
The first term in the above equation is the sum of squares of residuals and the second term is what   is specially added in ridge regression. Since this is a special term introduced in ridge regression let’s try to understand it further. For a data set with two predictors it will be a (B12 + B22), where a is the tuning parameter. It is also called penalty term as it puts a constraint on the least squares method of regression. In the quest of minimizing it, it is constrained to a particular value, depicted by the below equation:
Look at the above equation carefully, it is the equation of a shaded circle, with square of radius equal to s/a (the constraint), a part of which is shown below:
Combining the above graph with the cost contour graph will result in the graph as shown below:
The above graph gives the idea about ridge regression. It is where the least squares condition meets the parameter constraint or penalty condition. The radius of the circle representing constraint directly depends upon the tuning parameter (a). Smaller the value of tuning parameter, smaller the circle, higher the penalty. You can imagine it directly with the help of above graph. Smaller the value of tuning parameter, smaller will be the circle, closer to the origin will be the meeting point of two graphs, hence smaller the values of regression parameters.
In ridge regression, finding the parameters corresponding to minimum residual sum of squares is not what is sought. A constraint is put on the parameters to put a check on them and hence not allowing them to grow. This condition makes sure that different parameters are given weightage differently and hence becomes an important tool for feature selection. Please note that in ridge regression no parameter of any predictor is made zero but parameter weightage is varied
That is all for ridge regression. Do post your comments/suggestions. For any query regarding the topic, you can reach me on LinkedIn.
Further Read:

Have a nice time 😊