### Logistic Regression – Idea and Application Photo by Janita Sumeiko on Unsplash

• Discuss the idea behind logistic regression
• Explain it further through an example

What you are already supposed to know

Idea:
You are given a problem of predicting the gender of a person based on his/her height. To start with you are provided the data of 10 people, whose height and gender are known. You are asked to fit a mathematical model in this data in a way that will enable you to predict the gender of some other person whose height value is known but we have no information about his/her gender. This type of problem comes under the classification domain of supervised machine learning. If the problem demands from you to make classifications into various categories like True, False or rich, middle class, poor or Failure, Success etc then you are dealing with the classification problems. Its counterpart in machine learning is a regression problem where the problem demands from us to predict a continuous value like marks = 33.4%, weight = 60 Kg , etc. This article will discuss a classification algorithm called Logistic regression.
Although there are lot of classification algorithms out there that vary from each other in degree of complexity like Linear Discriminant Analysis, Decision Trees, Random Forest etc but Logistic Regression is the most basic one and is perfect to learn about classification models. Let’s jump to the above-stated problem and suppose, we are given the data of ten people as shown below: This problem is wholly different from other mathematical prediction problems. The reason being that on one hand we have continuous values of height but on the other hand, we have categorical values of gender. Our mathematical operations know how to deal with numbers but dealing with categorical values poses a challenge. To overcome this challenge in classification problems whether they are solved through logistic regression or some other algorithm, we always calculate the probability value associated with a class. In given context, we will calculate the probability associated with male class or female class. The probability with other class need not be explicitly calculated but can be obtained by subtracting probability of previously calculated class from one.
Logistic Regression
In the given data set, we have height as independent variable and gender as dependent variable. For time being, if we assume it to be a regression problem, it would have been solved by calculating the parameters of the regression model given as below:
In short, we would have calculated Bo and B1 and problem solved. The classification problem cannot be solved in this manner. As stated, we cannot calculate the value of gender but the probability associated with a particular gender class. In logistic regression we take inspiration from linear regression and use the linear model above to calculate probability. We just need a function that will take the above linear model as input and give us the probability value as output. In mathematical form, we should have something like this:
The above model calculates the probability of male class but we can use either of the two classes here. The function showed in the right side of the equation should satisfy the condition that it should take any real number input but should give the output in the range of 0 and 1 only, the reason being obvious. The above condition is satisfied by a function called Sigmoid or logistic function shown below:
The Sigmoid function has the domain of -inf to inf and the range of 0 to 1, which makes it perfect for probability calculation in logistic regression. If we plug the linear model in Sigmoid function we will get something as shown below:

The above equation can be easily rearranged to give us the more simple & easily understandable equation as shown below: The right-hand side of the equation is exactly what we have in linear regression model & the left hand side is the log of probability of odds, also called logit. So the above equation can be also written as:
logit(gender=male) = Bo + B1*height
This is the idea behind the logistic regression. Now let’s solve the problem given to us to see its application.
Application

We will use the Python code to train our model using the given data. Let’s first import the necessary modules. We need numpy and LogisticRegression class from sklearn

import numpy as np
from sklearn.linear_model import LogisticRegression

So now modules are imported we need to create an instance of  the LogisticRegression class

lg = LogisticRegression(solver = 'lbfgs')

The solver used is lbfgs. It’s now time to create the data set that we will use to train the model.

height = np.array([[132,134,133,139,145,144,165,160,155,140]])
gender = np.array([1,0,1,1,0,0,0,0,0,1])

Note that the sklearn can only handle numerical values, so here we are representing female class with 1 and male class with 0. Using the above datasets let’s train the model:

lg.fit(height.reshape(-1,1),gender.ravel())

Once the model is trained, you will get a confirmation message of the same. So no we have a trained model, lets check the parameters, the intercept (Bo) and the slope(B1).
lg.coef_
lg.intercept_

Running the above lines will show you the intercept value of 35.212 and the slope value of -0.252. Hence our trained model can be written as:

We can use the above equation to predict the gender of any person given his/her height or we can directly use the trained model as shown below to find the gender value of a person with height = 140cm:

lg.predict(np.array([]))

Give the above lines of code a try, you will get the idea. Note that the model actually gives us the probability value associated with a given class and it's up to us to decide the threshold value of the probability. The default is considered 0.5 i.e. all the probability values associated with male class above 0.5 are considered males & probability of male class if less than 0.5 is considered female. Also the separation boundary in logistic regression is linear which can be easily confirmed graphically.