Judging COVID-19 Infection through Symptoms — A Random Forest Based Classification Model

This article will try to:

  • Explain the classification model fitted into the symptom set of Covid-19, Cold and Flu 
  • Provide the reader a tool to predict the possibility of Covid-19 infection for a particular set of Symptoms
If you want to directly check out the prediction tool click here and if you want to know how it was done, read on.

Covid-19 (Disease caused by Corona Virus) has been declared as Pandemic by WHO and has affected thousands across the world. Starting from China, it has spread to around 114 countries as of now with Italy, Iran and Spain most affected ones. Since medical science has not come up with any vaccine for the disease yet, it poses a great danger. What makes the disease more deadly is the onset of symptoms, which not only take days to develop but also are a lot similar to seasonal flu and the common cold. Individuals often confuse the symptoms of Covid-19 with flu or cold and as such become virus carriers.
Many agencies around the world tried to explain Covid-19 Symptoms and what subtle differences are there with the symptoms of flu and cold through different media to educate the people. One such infographic was published by a famous Science Magazine, PopSci (Popular Science) on their Twitter handle [https://twitter.com/PopSci/status/1237750527166996481] based upon the data provided by WHO, which is shown as below:

The image shows how common the various symptoms are in case of different health complications e.g. Sneezing is shown to be very common in case of cold but absent in case of Covid-19 and flu.
We can use the above infographic and the data represented into it to train a classification model which can then be used to predict chances of Covid-19 infection for various combinations of Symptoms. As an illustration Not sneezing, having fever and shortness of breath can indicate Covid-19 infection but what if the symptoms are fever, no sneezing and shortness of breath. The infographic cannot tell you anything about this combination, but the model that will be trained on this data can tell you the probability of this deadly virus infection in this case.

Development of Classification Model (Technical Stuff )
The infographic when translated into tabular form with certain adjustments looks like as shown below:

The adjustments made to the data are:
  1. The symptoms which are common in all three conditions like fatigue, cough, aches/pains, runny nose and sore throat are removed.
  2. The symptom which is rare for cold and flu are not considered
  3. Any symptom present in Covid-19, even if it is highly rare, is taken into consideration.

The training procedures are carried out in Jupyter Notebook (Python 3.x).

The above table is pulled into the notebook and certain changes are made to the data to replace categorical variables with numerical ones:

Development of Decision Trees
Let’s use the above data to develop a decision tree first, the disease column is the response and all the other symptoms are treated as predictors. The graphviz library is used later to visualize the decision tree and sklearn is used to fit the DecisionTreeClassifier model with gini criterion. The code is shown below:

As you can see from the above visualization, only two predictors (Sneezing and Shortness of Breath) are used to train the model. It can be shown that running the training mode again & again will result in different decision trees as the feature selection of the decision tree model is random.

We can use the capability of Random Forest algorithm to use all those trees at the same time to make the disease prediction more accurate. The code is shown below:
The Random Forest model trained above can now be used to make judgements. The whole program has been incorporated in the interactive form (build with Anvil) below:

Random Forest-Based Classification Model. 
There are 1000 decision trees that work for you in the background when you submit the symptoms in the form below: 

Please note that the tool was build for learning purposes only and cannot be considered as a substitute to any medical test or disease detection procedure

I hope you like the attempt to analyze the symptoms through a Machine Learning Algorithm. All sort of suggestions are welcome.

Have a good time :) 


  1. Your explanations are very good, I tried to test in GUI but Encountering some error related to server connection.
    Thank you for explanation, i like it.

  2. where can i find the data set?


Post a comment