9 Steps in Model Training Process that demands a sequence


Photo by Kelly Sikkema on UnsplashAdd caption

Not only Model Training, everything in life demands a sequence. Take an example of lighting a gas stove:

 1) You turn on the gas pipeline valve to the stove.

2) You switch on the knob of the stove.

3) You pass a spark to light it.

Can you rearrange the above three steps and still light up the stove? The answer is no, try any combination, say passing a spark first and then turning on the stove knob & then lining up the gas pipeline. Yes, you guessed it right, there would be no flame. Sequences are often defined by engineers to operate machines and they call them SOPs (Standard Operating Procedures). A simple SOP for operating a pump would be to first lining out the suction and discharge and then turning it on. The SOPs are defined to make the process smooth, efficient, and error-free. Some jobs are so critical that not defining an SOP or a sequence will create chaos.

We will, in this article discuss the same procedure for the model training process in Machine Learning. We will consider an example of rainfall prediction through Linear Regression using various predictors like humidity, temperature, etc.  Like every other engineering process, a sequence is important here too. We can call it “An SOP for Model Training”. So, let us begin.

Step 0: Load Modules

We will begin by step zero. Zero, because the position of this step is not important at this place, you can run it as per requirement. The step is to first import all the necessary modules and classes that you think you will need in the training process. Anything you miss here can be imported later on too. The modules and classes that we require here are imported below:


import pandas as pd

from sklearn.linear_model import LinearRegression

from sklearn.model_selection import train_test_split

from sklearn.metrics import mean_squared_error

Step 1: Load Data

The first step in the model training process would be to get, view and understand the data that you require in the training process. The data that we are going to use is the rainfall data with predictors like humidity, temperature etc. as already mentioned. We would use pandas read_csv function to import the data frame and head function to view it

df = pd.read_csv(‘rain_data.csv’)

print(df.shape)

df.head()

Let’s understand the data first. This is the data of 10 countries with their names labeled as A, B, C…J. The dependent variable is rainfall_amt which mentions the amount of rainfall (mm) corresponding to a particular country. The independent variables are Temperature (degree Celsius), Humidity (percentage), Wind Speed (High or Low), and the Literacy rate of the country. The shape of the data frame is 10,6 meaning it has 10 rows and 6 columns.

Step 2: Remove unnecessary features

Get rid of the unnecessary features. Your data frame may contain a number of independent variables that your intuition and experience tell you are not related to the dependent variable and hence are not needed. To get rid of this extra information is necessary because why to carry crap.

In the case of our data frame, we do not require the name of the country and the literacy rate because these two variables have nothing to do with the rainfall.

df.drop(['Country','Literacy_rate'], inplace = True, axis = 1)

df.head()


Step 3: Deal with the missing data

Deal with the missing data. Not every data set is clean. The missing values are everywhere. It becomes necessary to deal with the missing values before-hand so that they may not affect the rest of the process. You can deal with the missing values in many ways. You can drop the entire row containing a missing value or you can replace them with some constant value or some statistic etc.

Let’s first check our data frame for missing values:

df.isnull().sum()

We have one missing value under humidity column. Let’s replace it with the mean value of the humidity:


df['Humidty'].fillna(df['Humidty'].mean(), inplace = True)

df.isnull().sum() 

Our data is now free of missing values and hence we can proceed. 

Step 4: Separate out variables

Separate out the dependent variable from your data set. Now that missing values are gone, it is time to view the dependent variable and independent variables separately. Why so, you would know.

In our data frame, the Rainfall_amt column contains the data of the dependent variable and the rest of the variables are independent. Let’s separate out the Rainfall_amt and call it Y.

Y = df.pop('Rainfall_amt')

print(Y)

df.head()


Step 5: Deal with categorical variables

The model training program cannot process categorical variables; it needs numbers to process them. So it is time to deal with the categorical variables. You can replace the categorical values with numbers like replace female with 0 & male with 1 or you can use one-hot encoding, whatever suits you.

Our data frame contains categorical values under Wind_speed column. Let’s encode them with one hot encoding.

df = pd.get_dummies(df)

df.head()

The Wind_speed column has been split into 2 columns,  Wind_speed_High and Wind_speed_low. The value of 1 shows presence and 0 shows absence.

Step 6: Split the data

Split the data into training and test set. This step may be considered optional depending upon the situation. In the scenario where we are not provided a test data file, we need to split the data given to us to make a test data frame for later on testing our model.

Our data frame has just 10 rows and hence it is not practical to split it but we will still split it in 70:30 train: test ratio to understand the concept.

x_train, x_test, y_train, y_test = train_test_split(df, Y, test_size = 0.30)

Step 7:  Train the model

This is the time to use that much-awaited function of fitting the model. The wait so far and the above 7 steps are worth it. So let’s fit the regression model on the training data:

lr = LinearRegression()

lr.fit(x_train, y_train)

Step 8: Feature Selection.

This step is most important and often ignored by amateurs. As you have the fitted model with you, it is time to calculate which features or independent variables really matter to your model. In short if possible, let’s cut the crap again.

Feature Selection is a wide topic and involves a numerous techniques. Most of the experts would argue the position of this step hereafter fitting the model. We will leave this part without the code and consider it as a task for the reader to learn more about it. You can read this article to start with

Step 9: Calculate score

Use test data set to make predictions and calculate the metrics to know how well the fitted model is doing. We will calculate the mean square error to know the same.

print('The mean square error is ',mean_squared_error(lr.predict(x_test),y_test))

All the above steps will make a Model Training SOP. The sequence has its own meaning. Like the gas stove example, we can’t just rearrange the steps and proceed. As an example suppose the Step 4: “Separate out variables” proceeds the Step 3: “Deal with missing values”, you will have a separate data for independent and dependent variables. Now if while dealing with the missing values, few rows of the data frames here and there are dropped, it will result in the data frame length mismatch and will generate an error. If by chance even the size gets matched too, it can create a bias in the model, imagine removing row 3 independent variable data and row 5 in independent variable data. It will either generate an error or make the results and the model misleading. The whole process sequence is shown graphically below:

The steps described above can form the one loop of the recursive training process. Most of the times the score we get at the end is not satisfactory and hence the whole process is repeated again and again.

I hope you like the article. Do post your comments. You can reach me on LinkedIn too for further queries.

 

Thanks,

Have a nice time 😊

Comments