Introduction
Pandas is an extensive Python library
used by Data Scientists in a varied number of fields. If you have worked in
Microsoft Excel or MySQL, you are already familiar with relational data
structures. You must be comfortable with the data arranged in rows and columns.
I came to know about the library while learning machine learning in Python. Pandas
not only offer you tools to carry out extensive data analysis and statistical
inferences but because it is built upon Numpy (an another important Python library)
it proves to be a platform to build highly complex machine learning and artificial
neural network algorithms. The three main data structures provided by Pandas
are Data Series (One dimensional), Data Frames (Two dimensional) and Data
Panels (multi-dimensional). Pandas is equipped with an enormous number of
functions and classes that give you enough power to deal with any data
analysis problem. We will in this article discuss the three main functions for
data frames that not only find application in almost every problem/project but
if understood can prove pivotal. Let’s not stretch it further and directly jump
over to the functions.
1.
Boolean Masking: Boolean masking is a
data frame slicing technique. In most of the problems we require a subset of a
data frame which fulfills certain conditions, as an example you may be asked to filter
out the data of only those students who passed a certain test in the data frame
of student marks and grades or from the census data frame filter out the data
of those males who are literate. These type of problems in Pandas can be easily
solved by Boolean masking. To understand this concept let’s consider a data-frame (df1) as shown below:
![]() |
Add caption |
The above data table summarizes the percentage of marks obtained by 4 students (A, B, C, D) in 4 subjects (Maths, Phy., Chem., and Prog.). Only 10 data entries are shown to make understanding simple. In pandas, every data frame has an index associated with it. It is something which identifies the data entries or data rows. In the current data frame, the Student column is treated as an index (Also note that index in Pandas is different from Primary Key in MySQL as index entries can be duplicate). In Boolean masking we pass a Boolean series (a series of True, False values) to the data frame which is then matched against the data frame index. All the data entries where the index matches with True value are returned and those entries (rows) where the index matches with false values are not returned. We will understand Boolean masking more clearly by answering a few questions related to the above data frame. If we are asked to pull out only the data of student A from the above data frame, we will have to first create the conditional statement which will create a Boolean series that we desire. If we write df1[‘Student’] == ‘A’, it will match the value ‘A’ with the student column and return True if the condition is met & False otherwise. The above conditional statement will return a series as shown below:
True
False
False
False
False
False
True
False
False
False
Observe the above
series carefully. The series contains True value only at the places where the
given condition of df1[‘Student’] == ‘A’ is met. The above created
Boolean series can be passed to the data frame as an indexing parameter to filter
out data. The statement can be written as:
df1[ df1[‘Student’]
== ‘A’ ]
The above-statement will return only those data entries where the student name is A and
ignore all else:
The idea behind
Boolean masking should be clear as of now. It is a powerful tool that can solve
more complex problems too as an example: Pull out the data of those students who
have scored more than 50% in maths
df1[ df1[‘Subject’]
== ‘Maths’ and df1[‘%age Marks’] > 50 ]
The Boolean
masking work in all sort of data frames and prove to be a handy tool in almost
all sort of problems/projects
2.
Grouping Data: Grouping data is another requirement that data scientists face e.g. in the context of the
current data frame we may be asked to find out the total marks or average marks
obtained by each student or subject-wise total or average marks obtained by
students etc. Pandas solve these types of problems by groupby function.
We will directly use this function to understand it. Let’s try to solve the
first question. “The total marks obtained by each student”. The problems require us to first group the data per student and then find the sum or average
marks. Programmatically as shown below:
df1.groupby(‘Student’)
It will group the same students and
their associated data together. It can be understood by the below illustration:
When df1.groupby(‘Student’) the statement is run, the original data frames can be thought as split into 4
different data frames based upon the name of the student. Now that we have
different data frames for different students, we can easily find out the sum or
average marks obtained by each student by the aggregate functions provided by
pandas.
df1.groupby(‘Student’).sum() will give us the table of total marks by each student and df1.groupby(‘Student’).mean() will give us the average.
3.
Merging: So far we have discussed only
the operations which are applicable to a single data frame, but this is not
where the data analysis is restricted. Many a time we are asked to carry out
the analysis across the multiple data frames. To get an idea about merging let’s
introduce another student data frame (df2):
The 2 circles
show any two data frames and a, b, c represent three possible regions.
The region b is called the intersection of the two data frames similarly
region a is data frame 1 (df1) minus the intersected part and similar
explanation goes for the region c.
The four merging processes that are possible are
- Left Merge: Region a + Region b. In the context of current data frames, it will be the sports details of all the students mentioned in the marks data frame. In Pandas it is implemented by running df1.merge(df2, on = ‘Student’, how = ‘left’). The on = ‘Student’ attribute says that the merging has to be done on the student column and how = ‘left’ says that the merging process has to be the left one. The above command will result in the table as shown below:
Note that in the left merge the left data frame prevails with the additional information extracted from the right data frame. Also, note the Nan values, these are inserted wherever the information is missing.
2. Right-Merge: Right merge has the same meaning as that of the left merge only the data frames flipped i.e. it represents the regions b + c. In Pandas it is implemented by running the below command:
df1.merge(df2, on = ‘Student’, how = ‘right’).
The
attributes have the usual meaning and will result in the data frame as shown
below:
The above table contains all the entries in the right table,
with additional information extracted from the left table. Also, observe all those
entries which are in the right table but not in left have Nan values against the
columns of the left data frame
3. Inner Merge: This represents the intersection of the two data frames i.e. the region b. The Pandas command for the same is as below:
df1.merge(df2,
on = ‘Student’, how = ‘inner’)
The resultant
data frame will be:
Note that only
the information for those entries have been included that are present in both
the data frames
4.
Outer Merge: The outer merge means
the union of the two data frames. In the context of the above-mentioned diagram
it means region a + region b + region c. In pandas the same is represented as:
df1.merge(df2, on = ‘Student’, how = ‘outer’)
and will result in the data frame as shown below:
In outer-merge, all the entries present in both the data frames are included and the data points which are presented are represented by Nan
Although Pandas is a package of a lot of powerful classes and functions, the three included above are ones
frequently used and with few other functions are enough to carry out even
highly complicated data analytics tasks. To further read about the topic you
can take the Data Analysis course on Coursera by University of Michigan or read
the book by Matt Harrison named Learning Pandas
Hope you like the article, you
can put your queries below or reach me on LinkedIn for further queries.
Thanks,
Have a nice time 😊
Comments
Post a comment