Pandas
While numpy deals only with homogeneous data types ( all numbers or all floats ), Pandas is heterogenous in dealing with data. Think of Pandas as a library that can deal with manipulating heterogenous data grids ( pretty much like excel )
Contents
Just enough Pandas
- Introduction
- Getting Started
- Dataframes
Introduction
Most data is heterogenous and tabular in nature. For example, look at the following data which shows some stats from google play store. There is text, numbers, floats etc.

numpy is not suited to manipulate this kind of data. For that we need Pandas
What is Pandas
Pandas is pretty much like a data manipulation tool ( think data munging, wrangling, preparation etc ) on a grid of data ( text, numbers, floats etc ). For example, if you look at the data grid above, and say you want to
Filter
- a particular category ( say only ART_AND_DESIGN )
- or all rows with Rating > 4.1
- or all rows where the category is ART_AND_DESIGN and rating > 4.1
Collapse or Group by
- and find how many rows are there in a particular category
- or find how many rows are there with Rating > 4.1
- or a combination of both
Read
- data from different formats ( excel, csv, SQL databases etc )
Handle
- missing data ( like NAs, blanks etc )
- erroneous data ( data that does not comply with the data type of the column ) etc
Manipulate
- combine data from different sources into one
- or split data into a set of rows or columns or both
- or extract a sub-set of data into another data frame ( say create a new data set only for category ART_AND_DESIGN)
- or slice the dataset based on a variety of parameters
- or insert/delete columns or rows from/to to the dataset
- Like add a new app category or delete the rating column
Think of Pandas as Excel on Steroids.
Why learn Pandas
In the context of Machine Learning and Python, Pandas is the gold standard in in-memory data management ( read or manipulate ). Written in C or Cython, Pandas is as fast as any C library in manipulating data. It is not uncommon for Pandas to comfortably handle large data sets ( around 5 to 10 GB ) without a hitch.
Our approach to Pandas
We will be using Pandas quite extensively in this Machine Learning course. However, we will cover most of the essential aspects of Pandas in this chapter and leave the more complicated options to later chapters where we would be encountering situations that would lead us to explore them. For now, we will keep it pretty simple and to small test datasets.
Getting Started
Installing Pandas
- pip
pip install pandas
If you are using Anaconda distribution, pandas is installed by default – you just have to enable it ( if necessary ). If you are just using the conda package manager for Python,
- conda
conda install pandas
You can verify if Pandas is already installed on your python installation using pip list.

Dataframes
A Data Frame is the main data structure in pandas. Think of a data frame as an excel grid. It is quite simply just rows and columns.

You can create, add, delete, filter data very easily in pandas. For starters, let’s see how easy it is to create a data set.
Create Dataframe
From List or Dictionary
For example, if you wanted to create a simple grid of data like this,

just create the columns ( names, population_m ) as 2 separate lists, combine them into a dictionary and pass it as an argument to the DataFrame() function.
import pandas as pd
names = ["India","United States","Canada"]
population_m = [1500,300,36]
d = {"names" :names , "population" : population_m}
df = pd.DataFrame(d)
df
names population
0 India 1500
1 United States 300
2 Canada 36
From an Empty Dataframe
You can create an empty data frame and start adding columns one by one.
df = pd.DataFrame()
df["names"] = names
df["population"] = population_m
print ( df )
names population
0 India 1500
1 United States 300
2 Canada 36
From Files
Here is a simple file with just 3 entries. It can directly be read into pandas using the read_csv ( ) function.

df = pd.read_csv("../data/simple_dataframe.csv")
df
names population
0 India 1500
1 United States 300
2 Canada 36
Display Dataframe
Once you read a dataframe, typically, you would want to examine it. We typically want to just see the first few rows or the last few rows. For that, you use the head ( ) or tail ( ) functions. For a change, instead of reading a CSV file, let’s read an excel file. In case this does not work, please install the python module xlrd using
<pre>
pip install xlrd
</pre>
Say, we have read an excel file like this ( contains a list of all countries and their population ). Since the list is big, we just want to display the first few entries.

df = pd.read_excel("../data/countries_population.xlsx")
df.head() # Shows the first few rows by default.
Rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
df.tail() # Shows the last few rows by default.
rank country continent population Change
228 229 Saint Helena, Africa 4035 0.003
229 230 Falkland Islands Americas 2910 0
230 231 Niue Oceania 1624 −0.4%
231 232 Tokelau Oceania 1282 0.014
232 233 Vatican City Europe 801 −1.1%
df.head(10) # You can very well ask for a specific number of rows to be displayed.
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
5 6 Pakistan Asia 193203476 0.02
6 7 Nigeria Africa 185989640 0.026
7 8 Bangladesh Asia 162951560 0.011
8 9 Russia Europe 146864513 −2.0%
9 10 Mexico Americas 127540423 0.013
But do you know how big the dataframe is that you have read from the excel file ?
Dataframe size
There are a couple of ways to find out how big the dataframe is. Like we discussed, a dataframe has rows and columns, right ?

The shape tuple tells us the the number of rows and columns are there in the dataframe. so, there are 233 rows and 5 columns in the population table.
df.shape
(233, 5)
If you wanted to find out the total number of data points in the data frame ( think of all the cells in the excel ), then you can just use the size tuple.
df.size
1165
Selecting Data from Dataframe
Selecting Data from Dataframes is also called indexing – because we use some form of indices. Let’s see that with an example.
import pandas as pd
df = pd.read_excel("../data/countries_population.xlsx")
#------------ select a subset of the data to start with ---------------
df_small = df.head()
df_small
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
Now, what if you want to just select the second column – country ?

df_small["country"]
0 China
1 India
2 United States
3 Indonesia
4 Brazil
Name: country, dtype: object
Another way to do this is to use the loc ( ) function
df_small.loc[:,"country"]
0 China
1 India
2 United States
3 Indonesia
4 Brazil
Name: country, dtype: object
This deserves some explanation. Let’s dive deeper into loc ( ) and iloc ( ) functions
loc and iloc functions
The best way to extract data from a dataframe is via the loc or iloc functions. To understand how to use these functions look at the picture below – It shows the indexing of the rows and columns starting with 0

iloc – Integer Location
iloc ( or integer location ) is one way to extract data from a data frame. The syntax is show below.

How do you specify the rows or columns ? Using integers or slices. Let’s see some examples.
- Get the first two row and all columns
df.iloc[[0,1],[0,1,2,3,4]]
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011

Now, instead of specifying all the column numbers, you can very well use the slicing notation and just use the : operator. So, the following would also yield the same result.
df.iloc[[0,1], :]
rank country continent population Change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011

Along the same lines, if you wanted to select all the rows as well, you could use the : operator. It brings out the entire dataframe.
df.iloc[:,:]
rank country continent population Change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
... ... ... ... ... ...
232 233 Vatican City Europe 801 −1.1%
233 rows × 5 columns

What if you wanted to select the 2nd, 3rd and 5th rows ?
df.iloc[[1,3,4],:]
rank country continent population Change
1 2 India Asia 1324171354 0.011
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008

Say you wanted just these highlighted rows only, how would you do it ? The same old indexing using indices.

df.iloc[[1,3,4],[1,4]]
country Change
1 India 0.011
3 Indonesia 0.011
4 Brazil 0.008
df.iloc[ 1:4 , 0:4 ]
rank country continent population
1 2 India Asia 1324171354
2 3 United States Americas 322179605
3 4 Indonesia Asia 261115456
How about a chunk like this ?

df.iloc[1:4,0:4]
rank country continent population
1 2 India Asia 1324171354
2 3 United States Americas 322179605
3 4 Indonesia Asia 261115456
Instead of using the entire list of indices in a list, we are using slices to specify the indices. This results in a compact syntax. However, remember that in Python, slices exclude the last element.
loc – Location
Another way to select data from a dataframe is using the labels ( row or column names – as opposed to numeric indices ). For example,

df.loc[ 1:4 , ["rank","country","population","change"] ]
Rank country population change
1 2 India 1324171354 0.011
2 3 United States 322179605 0.007
3 4 Indonesia 261115456 0.011
4 5 Brazil 207652865 0.008
Row names
Sometimes, dataframes have row names. For example the same dataframe could be looking like this.

df_country = df.set_index("country")
df_country.head()
rank continent population change
country
China 1 Asia 1403500365 0.004
India 2 Asia 1324171354 0.011
United States 3 Americas 322179605 0.007
Indonesia 4 Asia 261115456 0.011
Brazil 5 Americas 207652865 0.008
Now, you can use these label based indices to select rows ( as opposed to numeric indices )

df_country.loc[ ["India","Indonesia"], ["rank","continent","change"]]
rank continent change
country
India 2 Asia 0.011
Indonesia 4 Asia 0.011
You can even use slicing.
<img src=”./pics/loc-syntax-3.png
df_country.loc[ "India":"Brazil", ["rank","continent","change"]]
rank continent change
country
India 2 Asia 0.011
United States 3 Americas 0.007
Indonesia 4 Asia 0.011
Brazil 5 Americas 0.008
Boolean Mask
If you have heard about the WHERE clause in SQL, you will be right at home with boolean mask in dataframes. This concept has been borrowed from other math/statistical languages like MATLAB and R. Let’s take an example.
Get all the rows where the “Continent” = “Asia”
df_1 = df.iloc[0:5,:]
df_1
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008

df_1.loc[ df_1["continent"] == "Asia", :]
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
3 4 Indonesia Asia 261115456 0.011
How did this work ? The loc function works not just on column/row labels, but it also works with boolean values. The syntax
df_1["continent"] == "Asia"
results in a True/False ( boolean ) vector like below.
df_1["continent"] == "Asia"
0 True
1 True
2 False
3 True
4 False
Name: continent, dtype: bool
And all True rows are returned and False rows are suppressed. This is equivalent to the SQL WHERE clause. An equivalent SQL statement would be
SELECT * from df where continent = "Asia"
You are not limited to a single condition. Using boolean operations, you can make this as complicated as you want.

continent = df_1["continent"] == "Asia"
continent
0 True
1 True
2 False
3 True
4 False
Name: continent, dtype: bool
rank = df_1["rank"] <= 3
rank
0 True
1 True
2 True
3 False
4 False
Name: rank, dtype: bool
condition = continent & rank
condition
0 True
1 True
2 False
3 False
4 False
dtype: bool
df_1.loc[condition, :]
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
You can put all of it together as below.
df_1.loc[ (df_1["rank"] <=3 ) | (df_1["continent"] == "Asia") , :]
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
Rows
Add rows to Dataframe

df_small = df.head()
df_small
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
new_rows = df.iloc[5:6,:]
print ( new_rows )
rank country continent population change
5 6 Pakistan Asia 193203476 0.02
df_small = df.head()
df_small
#--------- what is the next row ? -----------
new_rows = df.iloc[5:6,:]
#--------- Add rows from one dataframe to another
df_small.append(new_rows)
df_small
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States America 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
Why didn’t it work ? That is because this operation doesn’t do it inplace. Instead, it returns a new dataframe. So, try this.In [27]:
df_new = df_small.append(new_rows)
df_new
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
5 6 Pakistan Asia 193203476 0.02
Delete rows from Dataframe
Use the drop ( ) function to delete rows from a dataframe. Here is how you drop a row.

df_new = df_new.drop(df.index[1])
df_new
rank country continent population change
0 1 China Asia 1403500365 0.004
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
5 6 Pakistan Asia 193203476 0.02
or drop multiple rows.

df_new = df_new.drop(df.index[[1,2,4]])
df_new
rank country continent population change
0 1 China Asia 1403500365 0.004
3 4 Indonesia Asia 261115456 0.011
5 6 Pakistan Asia 193203476 0.02
or drop a slice of rows.

df_new = df_new.drop(df.index[1:4])
df_new
rank country continent population change
0 1 China Asia 1403500365 0.004
4 5 Brazil Americas 207652865 0.008
5 6 Pakistan Asia 193203476 0.02
Columns
Add columns to Dataframe
We have already seen some examples of adding new columns to a dataframe. Just use the indexing syntax. For example, if you have a list that you wanted to add as a column, you can use the index name as follows.
df_1 = df_small.iloc[:,0:4]
df_1
rank country continent population
0 1 China Asia 1403500365
1 2 India Asia 1324171354
2 3 United States Americas 322179605
3 4 Indonesia Asia 261115456
4 5 Brazil Americas 207652865
or use the index numbering.
change = [0.004,0.011,0.007,0.011,0.008]
df_1["change"] = change
df_1
rank country continent population 4
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
If you wanted to insert the column at a specific index, use the insert ( ) function. Just specify the column name, the index at which it should be located and the actual data itself.
df_1 = df_small.iloc[:,0:4]
df_1
rank country continent population
0 1 China Asia 1403500365
1 2 India Asia 1324171354
2 3 United States Americas 322179605
3 4 Indonesia Asia 261115456
4 5 Brazil Americas 207652865
change = [0.004,0.011,0.007,0.011,0.008]
df_1.insert(3,"change",change)
df_1
rank country continent change population
0 1 China Asia 0.004 1403500365
1 2 India Asia 0.011 1324171354
2 3 United States Americas 0.007 322179605
3 4 Indonesia Asia 0.011 261115456
4 5 Brazil Americas 0.008 207652865
Delete columns from Dataframe
Deleting rows from a dataframe is just as easy. For example, if you wanted to drop the “rank” column from the dataframe above, use the drop ( ) function. Just don’t forget to include the axis parameter. Axis = 1 means, along the columns

df_new = df_new.drop("rank",axis=1)
df_new
country continent population change
0 China Asia 1403500365 0.004
1 India Asia 1324171354 0.011
2 United States Americas 322179605 0.007
3 Indonesia Asia 261115456 0.011
4 Brazil Americas 207652865 0.008
5 Pakistan Asia 193203476 0.02
To do the same using column index, rather than column names, use the df.columns ( ) function.
df_new = df_new.drop(df_new.columns[0],axis=1)
You could delete multiple columns as well.

df_new = df_new.drop(["rank","change","population"],axis=1)
# if you wanted to do it by column indices
# df_new = df_new.drop(df_new.columns[0,3,4],axis=1)
df_new
country continent
0 China Asia
1 India Asia
2 United States Americas
3 Indonesia Asia
4 Brazil Americas
5 Pakistan Asia
Grouping

In Pandas, there is an equivalent to the SQL GROUP BY sytax. Look at the example below. How do you compute the average population of each continent ? Pandas has an in-built groupby ( ) function that can do it for us.
df_small.groupby("continent").mean()
rank population
continent
Americas 4.000000 2.649162e+08
Asia 2.333333 9.962624e+08
d = df_small.groupby("continent")
d
Without the aggregator function ( mean() in this case ), groupby ( ) returns a DataFrameGroupBy object. In itself, it is not a dataframe yet, until you apply the aggregator function. mean ( ) is just one example. You can use generic functions like.
- size ( ) ## like count()
- sum ( )
- first ( )
- last ( )
or statistical functions like
- mean ( )
- std ( ) ## Standard Deviation
- var ( ) ## Variance
- min ( )
- max ( )
etc
df_small.groupby("continent").size()
continent
Americas 2
Asia 3
dtype: int64
df_small.groupby("continent").sum()
rank population
continent
Americas 8 529832470
Asia 7 2988787175
You are not limited to just a single criteria for grouping or aggregation. For example,

df_1 = df.iloc[0:5,:]
df_1.loc[:,"nato"] = [False, False, True, False , False]
df_1.groupby(["continent","nato"]).mean()
rank population
continent nato
Americas False 5.000000 2.076529e+08
True 3.000000 3.221796e+08
Asia False 2.333333 9.962624e+08
Merge Dataframes
Merging dataframes is a bit involved. We will start with the simplest of cases and move towards more complicated ones.
Concatenate Dataframes

# Rows 1 to 2
df1 = df.iloc[0:2,0:3]
# Rows 3 to 4
df2 = df.iloc[2:4,0:3]
# Rows 5 to 6
df3 = df.iloc[4:6,0:3]
df_new = pd.concat([df1,df2,df3])
df_new
rank country continent
0 1 China Asia
1 2 India Asia
2 3 United States Americas
3 4 Indonesia Asia
4 5 Brazil Americas
5 6 Pakistan Asia
What then is the difference between append ( ) and concat ( ) ? Think of concat as a modern version of append. By the way, append ( ) is not limited to just 2 dataframes. For example,
df_new = df1.append([df2,df3])
df_new
rank country continent
0 1 China Asia
1 2 India Asia
2 3 United States Americas
3 4 Indonesia Asia
4 5 Brazil Americas
5 6 Pakistan Asia
The major difference between them is flexibility that concat provides. For example, what if you wanted to concatenate along the columns ? Like this

df1 = df.iloc[0:3,0:3]
df2 = df.iloc[0:3,3:5]
print ( df1 )
print ( df2 )
rank country continent
0 1 China Asia
1 2 India Asia
2 3 United States Americas
population change
0 1403500365 0.004
1 1324171354 0.011
2 322179605 0.007
df_new = pd.concat([df1,df2],axis=1)
df_new
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
df_new["rank"] = None
df_new
rank country continent population change
0 None China Asia 1403500365 0.004
1 None India Asia 1324171354 0.011
2 None United States Americas 322179605 0.007
Merge
Merge is similar to database joins. Look at this example.

df_1 = df.iloc[0:5,:]
df_2 = pd.DataFrame()
df_2["country"] = [ "China","India","United States","Indonesia","Brazil"]
df_2["nato"] = [False,False,True,False,False]
df_1
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
3 4 Indonesia Asia 261115456 0.011
4 5 Brazil Americas 207652865 0.008
df_2
country nato
0 China False
1 India False
2 United States True
3 Indonesia False
4 Brazil False
df_1.merge(df_2)
rank country continent population change nato
0 1 China Asia 1403500365 0.004 False
1 2 India Asia 1324171354 0.011 False
2 3 United States Americas 322179605 0.007 True
3 4 Indonesia Asia 261115456 0.011 False
4 5 Brazil Americas 207652865 0.008 False
Inner Join
By default the type of join is called inner join. Think of this like an intersection. Expanding on the example above,

df_3 = pd.DataFrame()
df_3["country"] = [ "China","India","United States"]
df_3["nato"] = [False,False,True]
df_3
country nato
0 China False
1 India False
2 United States True
df_1.merge(df_3)
rank country continent population change nato
0 1 China Asia 1403500365 0.004 False
1 2 India Asia 1324171354 0.011 False
2 3 United States Americas 322179605 0.007 True
Left Join
What if the left dataframe has more rows and we want to retain all of them ? Then we use a left join

df_1.merge(df_3,how="left")
rank country continent population change nato
0 1 China Asia 1403500365 0.004 False
1 2 India Asia 1324171354 0.011 False
2 3 United States Americas 322179605 0.007 True
3 4 Indonesia Asia 261115456 0.011 NaN
4 5 Brazil Americas 207652865 0.008 NaN
Right Join
On the contrary, if the right dataframe has more rows and if you want to preserve all the rows in the right dataframe, use right join

df_1 = df.iloc[0:3,:]
df_1
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
df_3 = pd.DataFrame()
df_3["country"] = [ "China","India","United States","Indonesia","Brazil"]
df_3["nato"] = [False,False,True,False,False]
df_3
country nato
0 China False
1 India False
2 United States True
3 Indonesia False
4 Brazil False
df_1.merge(df_3,how="right")
rank country continent population change nato
0 1.0 China Asia 1.403500e+09 0.004 False
1 2.0 India Asia 1.324171e+09 0.011 False
2 3.0 United States Americas 3.221796e+08 0.007 True
3 NaN Indonesia NaN NaN NaN False
4 NaN Brazil NaN NaN NaN False
Outer Join
There is another type of join called outer join. Let’s try this on the same datasets as above.
df_1.merge(df_3,how="outer")
rank country continent population change nato
0 1.0 China Asia 1.403500e+09 0.004 False
1 2.0 India Asia 1.324171e+09 0.011 False
2 3.0 United States Americas 3.221796e+08 0.007 True
3 NaN Indonesia NaN NaN NaN False
4 NaN Brazil NaN NaN NaN False
Surprisingly, you see the same results, right ? Let’s take a better example to illustrate this.

df_1 = df.iloc[[0,1,2,5,6],:]
df_1
rank country continent population change
0 1 China Asia 1403500365 0.004
1 2 India Asia 1324171354 0.011
2 3 United States Americas 322179605 0.007
5 6 Pakistan Asia 193203476 0.02
6 7 Nigeria Africa 185989640 0.026
df_3 = df_3.iloc[[0,2,3],:]
df_3
country nato
0 China False
2 United States True
3 Indonesia False
df_1.merge(df_3,how="outer")
rank country continent population change nato
0 1.0 China Asia 1.403500e+09 0.004 False
1 2.0 India Asia 1.324171e+09 0.011 NaN
2 3.0 United States Americas 3.221796e+08 0.007 True
3 6.0 Pakistan Asia 1.932035e+08 0.02 NaN
4 7.0 Nigeria Africa 1.859896e+08 0.026 NaN
5 NaN Indonesia NaN NaN NaN False