3 Pandas functions to rule the data frames


Introduction

Pandas is an extensive Python library used by Data Scientists in a varied number of fields. If you have worked in Microsoft Excel or MySQL, you are already familiar with relational data structures. You must be comfortable with the data arranged in rows and columns. I came to know about the library while learning machine learning in Python. Pandas not only offer you tools to carry out extensive data analysis and statistical inferences but because it is built upon Numpy (an another important Python library) it proves to be a platform to build highly complex machine learning and artificial neural network algorithms. The three main data structures provided by Pandas are Data Series (One dimensional), Data Frames (Two dimensional) and Data Panels (multi-dimensional). Pandas is equipped with an enormous number of functions and classes that give you enough power to deal with any data analysis problem. We will in this article discuss the three main functions for data frames that not only find application in almost every problem/project but if understood can prove pivotal. Let’s not stretch it further and directly jump over to the functions.

1.       Boolean Masking: Boolean masking is a data frame slicing technique. In most of the problems we require a subset of a data frame which fulfills certain conditions, as an example you may be asked to filter out the data of only those students who passed a certain test in the data frame of student marks and grades or from the census data frame filter out the data of those males who are literate. These type of problems in Pandas can be easily solved by Boolean masking. To understand this concept let’s consider a data-frame (df1) as shown below: 

Add caption

The above data table summarizes the percentage of marks obtained by 4 students (A, B, C, D) in 4 subjects (Maths, Phy., Chem., and Prog.). Only 10 data entries are shown to make understanding simple. In pandas, every data frame has an index associated with it. It is something which identifies the data entries or data rows. In the current data frame, the Student column is treated as an index (Also note that index in Pandas is different from Primary Key in MySQL as index entries can be duplicate). In Boolean masking we pass a Boolean series (a series of True, False values) to the data frame which is then matched against the data frame index. All the data entries where the index matches with True value are returned and those entries (rows) where the index matches with false values are not returned. We will understand Boolean masking more clearly by answering a few questions related to the above data frame. If we are asked to pull out only the data of student A from the above data frame, we will have to first create the conditional statement which will create a Boolean series that we desire. If we write df1[‘Student’] == ‘A’, it will match the value ‘A’ with the student column and return True if the condition is met & False otherwise. The above conditional statement will return a series as shown below:

True

False

False

False

False

False

True

False

False

False

Observe the above series carefully. The series contains True value only at the places where the given condition of df1[‘Student’] == ‘A’ is met. The above created Boolean series can be passed to the data frame as an indexing parameter to filter out data. The statement can be written as:

df1[ df1[‘Student’] == ‘A’ ]

The above-statement will return only those data entries where the student name is A and ignore all else:

The idea behind Boolean masking should be clear as of now. It is a powerful tool that can solve more complex problems too as an example: Pull out the data of those students who have scored more than 50% in maths

df1[ df1[‘Subject’] == ‘Maths’ and df1[‘%age Marks’] > 50  ]

The Boolean masking work in all sort of data frames and prove to be a handy tool in almost all sort of problems/projects

2.       Grouping Data: Grouping data is another requirement that data scientists face e.g. in the context of the current data frame we may be asked to find out the total marks or average marks obtained by each student or subject-wise total or average marks obtained by students etc. Pandas solve these types of problems by groupby function. We will directly use this function to understand it. Let’s try to solve the first question. “The total marks obtained by each student”. The problems require us to first group the data per student and then find the sum or average marks. Programmatically as shown below:

df1.groupby(‘Student’)  

It will group the same students and their associated data together. It can be understood by the below illustration:

When df1.groupby(‘Student’) the statement is run, the original data frames can be thought as split into 4 different data frames based upon the name of the student. Now that we have different data frames for different students, we can easily find out the sum or average marks obtained by each student by the aggregate functions provided by pandas.

 df1.groupby(‘Student’).sum() will give us the table of total marks by each student and df1.groupby(‘Student’).mean() will give us the average.

3.       Merging: So far we have discussed only the operations which are applicable to a single data frame, but this is not where the data analysis is restricted. Many a time we are asked to carry out the analysis across the multiple data frames. To get an idea about merging let’s introduce another student data frame (df2):


 The above data frame presents information about the sports enrolment of the students. Along with the former data frame, the sports data frame can be merged with it to extract a lot of meaningful information. Before going for programming stuff, let’s first understand the merging process. The two data frames can be mainly merged in 4 different ways as shown below:

The 2 circles show any two data frames and a, b, c represent three possible regions. The region b is called the intersection of the two data frames similarly region a is data frame 1 (df1) minus the intersected part and similar explanation goes for the region c.

The four merging processes that are possible are

    1. Left Merge: Region a + Region b. In the context of current data frames, it will be the sports details of all the students mentioned in the marks data frame. In Pandas it is implemented by running df1.merge(df2, on = ‘Student’, how = ‘left’). The on = ‘Student’ attribute says that the merging has to be done on the student column and how = ‘left’ says that the merging process has to be the left one. The above command will result in the table as shown below:

                   

Note that in the left merge the left data frame prevails with the additional information extracted from the right data frame. Also, note the Nan values, these are inserted wherever the information is missing.

2. Right-Merge: Right merge has the same meaning as that of the left merge only the data frames flipped i.e. it represents the regions b + c. In Pandas it is implemented by running the below command:

df1.merge(df2, on = ‘Student’, how = ‘right’). 

The attributes have the usual meaning and will result in the data frame as shown below:

The above table contains all the entries in the right table, with additional information extracted from the left table. Also, observe all those entries which are in the right table but not in left have Nan values against the columns of the left data frame

3.       Inner Merge: This represents the intersection of the two data frames i.e. the region b. The Pandas command for the same is as below:

df1.merge(df2, on = ‘Student’, how = ‘inner’)

The resultant data frame will be:

Note that only the information for those entries have been included that are present in both the data frames

4.       Outer Merge: The outer merge means the union of the two data frames. In the context of the above-mentioned diagram it means region a + region b + region c. In pandas the same is represented as:

df1.merge(df2, on = ‘Student’, how = ‘outer’)

and will result in the data frame as shown below:

In outer-merge, all the entries present in both the data frames are included and the data points which are presented are represented by Nan

Although Pandas is a package of a lot of powerful classes and functions, the three included above are ones frequently used and with few other functions are enough to carry out even highly complicated data analytics tasks. To further read about the topic you can take the Data Analysis course on Coursera by University of Michigan or read the book by Matt Harrison named Learning Pandas

Hope you like the article, you can put your queries below or reach me on LinkedIn for further queries.

 

Thanks,

Have a nice time 😊


Comments