Pandas

Pandas


  Machine Learning in Python

While numpy deals only with homogeneous data types ( all numbers or all floats ), Pandas is heterogenous in dealing with data. Think of Pandas as a library that can deal with manipulating heterogenous data grids ( pretty much like excel )

Contents

Just enough Pandas


Introduction

Most data is heterogenous and tabular in nature. For example, look at the following data which shows some stats from google play store. There is text, numbers, floats etc.

numpy is not suited to manipulate this kind of data. For that we need Pandas


What is Pandas

Pandas is pretty much like a data manipulation tool ( think data munging, wrangling, preparation etc ) on a grid of data ( text, numbers, floats etc ). For example, if you look at the data grid above, and say you want to

Filter

  • a particular category ( say only ART_AND_DESIGN )
  • or all rows with Rating > 4.1
  • or all rows where the category is ART_AND_DESIGN and rating > 4.1

Collapse or Group by

  • and find how many rows are there in a particular category
  • or find how many rows are there with Rating > 4.1
  • or a combination of both

Read

  • data from different formats ( excel, csv, SQL databases etc )

Handle

  • missing data ( like NAs, blanks etc )
  • erroneous data ( data that does not comply with the data type of the column ) etc

Manipulate

  • combine data from different sources into one
  • or split data into a set of rows or columns or both
  • or extract a sub-set of data into another data frame ( say create a new data set only for category ART_AND_DESIGN)
  • or slice the dataset based on a variety of parameters
  • or insert/delete columns or rows from/to to the dataset
    • Like add a new app category or delete the rating column

Think of Pandas as Excel on Steroids.


Why learn Pandas

In the context of Machine Learning and Python, Pandas is the gold standard in in-memory data management ( read or manipulate ). Written in C or Cython, Pandas is as fast as any C library in manipulating data. It is not uncommon for Pandas to comfortably handle large data sets ( around 5 to 10 GB ) without a hitch.


Our approach to Pandas

We will be using Pandas quite extensively in this Machine Learning course. However, we will cover most of the essential aspects of Pandas in this chapter and leave the more complicated options to later chapters where we would be encountering situations that would lead us to explore them. For now, we will keep it pretty simple and to small test datasets.


Getting Started

Installing Pandas

  • pip
pip install pandas

If you are using Anaconda distribution, pandas is installed by default – you just have to enable it ( if necessary ). If you are just using the conda package manager for Python,

  • conda
 conda install pandas

You can verify if Pandas is already installed on your python installation using pip list.


Dataframes

Data Frame is the main data structure in pandas. Think of a data frame as an excel grid. It is quite simply just rows and columns.

You can create, add, delete, filter data very easily in pandas. For starters, let’s see how easy it is to create a data set.


Create Dataframe

From List or Dictionary

For example, if you wanted to create a simple grid of data like this,

just create the columns ( names, population_m ) as 2 separate lists, combine them into a dictionary and pass it as an argument to the DataFrame() function.

import pandas as pd

names        = ["India","United States","Canada"]
population_m = [1500,300,36]

d = {"names" :names , "population" : population_m}

df = pd.DataFrame(d)
df
        names           population
0   	India           1500
1   	United States   300
2   	Canada          36

From an Empty Dataframe

You can create an empty data frame and start adding columns one by one.

df = pd.DataFrame()
df["names"]      = names
df["population"] = population_m
print ( df )

        names           population
0   	India           1500
1   	United States   300
2   	Canada          36

From Files

Here is a simple file with just 3 entries. It can directly be read into pandas using the read_csv ( ) function.

df = pd.read_csv("../data/simple_dataframe.csv")
df

        names           population
0   	India           1500
1   	United States   300
2   	Canada          36

Display Dataframe

Once you read a dataframe, typically, you would want to examine it. We typically want to just see the first few rows or the last few rows. For that, you use the head ( ) or tail ( ) functions. For a change, instead of reading a CSV file, let’s read an excel file. In case this does not work, please install the python module xlrd using

<pre>
pip install xlrd
</pre>

Say, we have read an excel file like this ( contains a list of all countries and their population ). Since the list is big, we just want to display the first few entries.

df = pd.read_excel("../data/countries_population.xlsx")
df.head() # Shows the first few rows by default.
   	Rank  	country         continent   population  change
0   1   	China           Asia        1403500365  0.004
1   2   	India           Asia        1324171354  0.011
2   3   	United States   Americas    322179605   0.007
3   4   	Indonesia   	Asia        261115456   0.011
4   5   	Brazil          Americas    207652865   0.008 
df.tail() # Shows the last few rows by default.
 	rank	country	             continent   population	Change
228	229	Saint Helena,         Africa     4035	        0.003
229	230	Falkland Islands      Americas   2910	        0
230	231	Niue	              Oceania	 1624	       −0.4%
231	232	Tokelau	              Oceania	 1282	        0.014
232	233   	Vatican City	      Europe	 801	       −1.1%

df.head(10)  # You can very well ask for a specific number of rows to be displayed.
    rank    country         continent  population   change
0   1   	China           Asia       1403500365   0.004
1   2   	India           Asia       1324171354   0.011
2   3   	United States   Americas   322179605    0.007
3   4   	Indonesia   	Asia       261115456    0.011
4   5   	Brazil          Americas   207652865    0.008
5   6   	Pakistan    	Asia       193203476    0.02
6   7   	Nigeria         Africa     185989640    0.026
7   8   	Bangladesh  	Asia       162951560    0.011
8   9  		Russia          Europe     146864513   −2.0%
9   10  	Mexico          Americas   127540423    0.013

But do you know how big the dataframe is that you have read from the excel file ?


Dataframe size

There are a couple of ways to find out how big the dataframe is. Like we discussed, a dataframe has rows and columns, right ?

The shape tuple tells us the the number of rows and columns are there in the dataframe. so, there are 233 rows and 5 columns in the population table.

df.shape
(233, 5)

If you wanted to find out the total number of data points in the data frame ( think of all the cells in the excel ), then you can just use the size tuple.

df.size
1165

Selecting Data from Dataframe

Selecting Data from Dataframes is also called indexing – because we use some form of indices. Let’s see that with an example.

import pandas as pd

df = pd.read_excel("../data/countries_population.xlsx")
#------------ select a subset of the data to start with ---------------

df_small = df.head()
df_small
 	rank	country	        continent  population	change
0	1	China	        Asia	   1403500365	0.004
1	2	India	        Asia	   1324171354	0.011
2	3	United States	Americas   322179605	0.007
3	4	Indonesia	Asia	   261115456	0.011
4	5	Brazil	        Americas   207652865	0.008

Now, what if you want to just select the second column – country ?

df_small["country"]
0            China
1            India
2    United States
3        Indonesia
4           Brazil
Name: country, dtype: object

Another way to do this is to use the loc ( ) function

df_small.loc[:,"country"]

0            China
1            India
2    United States
3        Indonesia
4           Brazil
Name: country, dtype: object

This deserves some explanation. Let’s dive deeper into loc ( ) and iloc ( ) functions


loc and iloc functions

The best way to extract data from a dataframe is via the loc or iloc functions. To understand how to use these functions look at the picture below – It shows the indexing of the rows and columns starting with 0


iloc – Integer Location

iloc ( or integer location ) is one way to extract data from a data frame. The syntax is show below.

How do you specify the rows or columns ? Using integers or slices. Let’s see some examples.

  • Get the first two row and all columns
df.iloc[[0,1],[0,1,2,3,4]]
      rank	country	        continent  population	change
0	1	China	        Asia	   1403500365	0.004
1	2	India	        Asia	   1324171354	0.011

Now, instead of specifying all the column numbers, you can very well use the slicing notation and just use the : operator. So, the following would also yield the same result.

df.iloc[[0,1], :]
	rank	country	continent population	Change
0	1	China	Asia	  1403500365	0.004
1	2	India	Asia	  1324171354	0.011

Along the same lines, if you wanted to select all the rows as well, you could use the : operator. It brings out the entire dataframe.

df.iloc[:,:]
 	rank	country			continent	population	Change
0	1	China			Asia		1403500365	0.004
1	2	India			Asia		1324171354	0.011
...	...	...	...	...	...
232	233	Vatican City		Europe		801		−1.1%	

233 rows × 5 columns

What if you wanted to select the 2nd, 3rd and 5th rows ?

df.iloc[[1,3,4],:]

rank	country		continent	population	Change
1	2	India		Asia		1324171354	0.011
3	4	Indonesia	Asia		261115456	0.011
4	5	Brazil		Americas	207652865	0.008

Say you wanted just these highlighted rows only, how would you do it ? The same old indexing using indices.

df.iloc[[1,3,4],[1,4]]
	country	    Change
1	India	    0.011
3	Indonesia   0.011
4	Brazil	    0.008
df.iloc[ 1:4 , 0:4 ]
	rank	country	  	continent	population
1	2	India		Asia		1324171354
2	3	United States	Americas	322179605
3	4	Indonesia	Asia		261115456

How about a chunk like this ?

df.iloc[1:4,0:4]
	rank	country		continent	population
1	2	India		Asia		1324171354
2	3	United States	Americas	322179605
3	4	Indonesia	Asia		261115456

Instead of using the entire list of indices in a list, we are using slices to specify the indices. This results in a compact syntax. However, remember that in Python, slices exclude the last element.

loc – Location

Another way to select data from a dataframe is using the labels ( row or column names – as opposed to numeric indices ). For example,

df.loc[ 1:4 , ["rank","country","population","change"] ]
	Rank	country		population	change
1	2	India		1324171354	0.011
2	3	United States	322179605	0.007
3	4	Indonesia	261115456	0.011
4	5	Brazil		207652865	0.008

Row names

Sometimes, dataframes have row names. For example the same dataframe could be looking like this.

df_country = df.set_index("country")
df_country.head()
		rank	continent	population	change
country				
China		1	Asia		1403500365	0.004
India		2	Asia		1324171354	0.011
United States	3	Americas	322179605	0.007
Indonesia	4	Asia		261115456	0.011
Brazil		5	Americas	207652865	0.008

Now, you can use these label based indices to select rows ( as opposed to numeric indices )

df_country.loc[ ["India","Indonesia"], ["rank","continent","change"]]
           rank	continent	change
country			
India	   2	Asia	        0.011
Indonesia  4	Asia    	0.011

You can even use slicing.

<img src=”./pics/loc-syntax-3.png

df_country.loc[ "India":"Brazil", ["rank","continent","change"]]
      		rank	continent	change
country			
India		2	Asia		0.011
United States	3	Americas	0.007
Indonesia	4	Asia		0.011
Brazil		5	Americas	0.008

Boolean Mask

If you have heard about the WHERE clause in SQL, you will be right at home with boolean mask in dataframes. This concept has been borrowed from other math/statistical languages like MATLAB and R. Let’s take an example.

Get all the rows where the “Continent” = “Asia”

df_1 = df.iloc[0:5,:]
df_1
  	rank	country		continent	population	change
0	1	China		Asia		1403500365	0.004
1	2	India		Asia		1324171354	0.011
2	3	United States	Americas	322179605	0.007
3	4	Indonesia	Asia		261115456	0.011
4	5	Brazil		Americas	207652865	0.008
df_1.loc[ df_1["continent"] == "Asia", :]

	rank	country		continent	population	change
0	1	China		Asia		1403500365	0.004
1	2	India		Asia		1324171354	0.011
3	4	Indonesia	Asia		261115456	0.011

How did this work ? The loc function works not just on column/row labels, but it also works with boolean values. The syntax

 df_1["continent"] == "Asia"

results in a True/False ( boolean ) vector like below.

df_1["continent"] == "Asia"
0     True
1     True
2    False
3     True
4    False
Name: continent, dtype: bool

And all True rows are returned and False rows are suppressed. This is equivalent to the SQL WHERE clause. An equivalent SQL statement would be

SELECT * from df where continent = "Asia"

You are not limited to a single condition. Using boolean operations, you can make this as complicated as you want.

continent = df_1["continent"] == "Asia" 
continent

0     True
1     True
2    False
3     True
4    False
Name: continent, dtype: bool
rank =  df_1["rank"] <= 3
rank
0     True
1     True
2     True
3    False
4    False
Name: rank, dtype: bool
condition = continent & rank
condition

0     True
1     True
2    False
3    False
4    False
dtype: bool
df_1.loc[condition, :]

rank	country	continent	population	change
0	1	China	Asia		1403500365	0.004
1	2	India	Asia		1324171354	0.011

You can put all of it together as below.

df_1.loc[ (df_1["rank"] <=3 ) | (df_1["continent"] == "Asia") , :]
rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
1	2	India	Asia	1324171354	0.011
2	3	United States	Americas	322179605	0.007
3	4	Indonesia	Asia	261115456	0.011

Rows

Add rows to Dataframe

df_small = df.head()
df_small
rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
1	2	India	Asia	1324171354	0.011
2	3	United States	Americas	322179605	0.007
3	4	Indonesia	Asia	261115456	0.011
4	5	Brazil	Americas	207652865	0.008
new_rows = df.iloc[5:6,:]
print ( new_rows )
      rank   country    continent  population change
5     6      Pakistan      Asia   193203476   0.02
df_small = df.head()
df_small
#--------- what is the next row ? -----------
new_rows = df.iloc[5:6,:]

#--------- Add rows from one dataframe to another
df_small.append(new_rows)
df_small

        rank	country       continent	 population	change
0	1	China	        Asia	 1403500365	0.004
1	2	India	        Asia	 1324171354	0.011
2	3	United States	America	 322179605	0.007
3	4	Indonesia	Asia	 261115456	0.011
4	5	Brazil	        Americas 207652865	0.008

Why didn’t it work ? That is because this operation doesn’t do it inplace. Instead, it returns a new dataframe. So, try this.In [27]:

df_new = df_small.append(new_rows)
df_new

	rank	country	  	continent	population	change
0	1	China		Asia		1403500365	0.004
1	2	India		Asia		1324171354	0.011
2	3	United States	Americas	322179605	0.007
3	4	Indonesia	Asia		261115456	0.011
4	5	Brazil		Americas	207652865	0.008
5	6	Pakistan	Asia		193203476	0.02

Delete rows from Dataframe

Use the drop ( ) function to delete rows from a dataframe. Here is how you drop a row.

df_new = df_new.drop(df.index[1])
df_new
   	rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
2	3	United States	Americas	322179605	0.007
3	4	Indonesia	Asia	261115456	0.011
4	5	Brazil	Americas	207652865	0.008
5	6	Pakistan	Asia	193203476	0.02

or drop multiple rows.

df_new = df_new.drop(df.index[[1,2,4]])
df_new
rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
3	4	Indonesia	Asia	261115456	0.011
5	6	Pakistan	Asia	193203476	0.02

or drop a slice of rows.

df_new = df_new.drop(df.index[1:4])
df_new
rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
4	5	Brazil	Americas	207652865	0.008
5	6	Pakistan	Asia	193203476	0.02

Columns

Add columns to Dataframe

We have already seen some examples of adding new columns to a dataframe. Just use the indexing syntax. For example, if you have a list that you wanted to add as a column, you can use the index name as follows.

df_1 = df_small.iloc[:,0:4]
df_1

rank	country	continent	population
0	1	China	Asia	1403500365
1	2	India	Asia	1324171354
2	3	United States	Americas	322179605
3	4	Indonesia	Asia	261115456
4	5	Brazil	Americas	207652865

or use the index numbering.

change = [0.004,0.011,0.007,0.011,0.008]
df_1["change"] = change
df_1

       	rank    	country  	       continent	population		4
0	1	China	        Asia	           	1403500365	0.004
1	2	India	        Asia	       	 1324171354	0.011
2	3	United States     Americas	322179605	0.007
3	4	Indonesia	        Asia	      	  261115456	0.011
4	5	Brazil	        Americas	207652865	0.008

If you wanted to insert the column at a specific index, use the insert ( ) function. Just specify the column name, the index at which it should be located and the actual data itself.

df_1 = df_small.iloc[:,0:4]
df_1

        rank	country	        continent	population
0	1	China	         Asia	        1403500365
1	2	India	         Asia	        1324171354
2	3	United States	 Americas	322179605
3	4	Indonesia	 Asia	        261115456
4	5	Brazil	        Americas	207652865
change = [0.004,0.011,0.007,0.011,0.008]
df_1.insert(3,"change",change)
df_1
rank	country	continent	change	population
0	1	China	Asia	0.004	1403500365
1	2	India	Asia	0.011	1324171354
2	3	United States	Americas	0.007	322179605
3	4	Indonesia	Asia	0.011	261115456
4	5	Brazil	Americas	0.008	207652865

Delete columns from Dataframe

Deleting rows from a dataframe is just as easy. For example, if you wanted to drop the “rank” column from the dataframe above, use the drop ( ) function. Just don’t forget to include the axis parameter. Axis = 1 means, along the columns

df_new = df_new.drop("rank",axis=1)
df_new

country	continent	population	change
0	China	Asia	1403500365	0.004
1	India	Asia	1324171354	0.011
2	United States	Americas	322179605	0.007
3	Indonesia	Asia	261115456	0.011
4	Brazil	Americas	207652865	0.008
5	Pakistan	Asia	193203476	0.02

To do the same using column index, rather than column names, use the df.columns ( ) function.

df_new = df_new.drop(df_new.columns[0],axis=1)

You could delete multiple columns as well.

df_new = df_new.drop(["rank","change","population"],axis=1)
# if you wanted to do it by column indices
# df_new = df_new.drop(df_new.columns[0,3,4],axis=1)
df_new

        country	continent
0	China	Asia
1	India	Asia
2	United States	Americas
3	Indonesia Asia
4	Brazil  Americas
5	Pakistan Asia

Grouping

In Pandas, there is an equivalent to the SQL GROUP BY sytax. Look at the example below. How do you compute the average population of each continent ? Pandas has an in-built groupby ( ) function that can do it for us.

df_small.groupby("continent").mean()
	rank	population
continent		
Americas	4.000000	2.649162e+08
Asia	2.333333	9.962624e+08
d = df_small.groupby("continent")
d

Without the aggregator function ( mean() in this case ), groupby ( ) returns a DataFrameGroupBy object. In itself, it is not a dataframe yet, until you apply the aggregator function. mean ( ) is just one example. You can use generic functions like.

  • size ( ) ## like count()
  • sum ( )
  • first ( )
  • last ( )

or statistical functions like

  • mean ( )
  • std ( ) ## Standard Deviation
  • var ( ) ## Variance
  • min ( )
  • max ( )

etc

df_small.groupby("continent").size()

continent
Americas    2
Asia        3
dtype: int64
df_small.groupby("continent").sum()
	rank	population
continent		
Americas	8	529832470
Asia	7	2988787175

You are not limited to just a single criteria for grouping or aggregation. For example,

df_1 = df.iloc[0:5,:]
df_1.loc[:,"nato"] = [False, False, True, False , False]
df_1.groupby(["continent","nato"]).mean()

		rank	population
continent	nato		
Americas	False	5.000000	2.076529e+08
True	3.000000	3.221796e+08
Asia	False	2.333333	9.962624e+08

Merge Dataframes

Merging dataframes is a bit involved. We will start with the simplest of cases and move towards more complicated ones.

Concatenate Dataframes

# Rows 1 to 2
df1 = df.iloc[0:2,0:3]
# Rows 3 to 4
df2 = df.iloc[2:4,0:3]
# Rows 5 to 6
df3 = df.iloc[4:6,0:3]

df_new = pd.concat([df1,df2,df3])
df_new
	rank	country	continent
0	1	China	Asia
1	2	India	Asia
2	3	United States	Americas
3	4	Indonesia	Asia
4	5	Brazil	Americas
5	6	Pakistan	Asia

What then is the difference between append ( ) and concat ( ) ? Think of concat as a modern version of append. By the way, append ( ) is not limited to just 2 dataframes. For example,

df_new = df1.append([df2,df3]) 
df_new 
rank	country	continent
0	1	China	Asia
1	2	India	Asia
2	3	United States	Americas
3	4	Indonesia	Asia
4	5	Brazil	Americas
5	6	Pakistan	Asia

The major difference between them is flexibility that concat provides. For example, what if you wanted to concatenate along the columns ? Like this

df1 = df.iloc[0:3,0:3]
df2 = df.iloc[0:3,3:5]
print ( df1 ) 
print ( df2 )
rank        country continent
0     1          China      Asia
1     2          India      Asia
2     3  United States  Americas
   population change
0  1403500365  0.004
1  1324171354  0.011
2   322179605  0.007
df_new = pd.concat([df1,df2],axis=1)
df_new
	rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
1	2	India	Asia	1324171354	0.011
2	3	United States	Americas	322179605	0.007
df_new["rank"] = None
df_new
rank	country	continent	population	change
0	None	China	Asia	1403500365	0.004
1	None	India	Asia	1324171354	0.011
2	None	United States	Americas	322179605	0.007

Merge

Merge is similar to database joins. Look at this example.

df_1 = df.iloc[0:5,:]
df_2 = pd.DataFrame()
df_2["country"] = [ "China","India","United States","Indonesia","Brazil"]
df_2["nato"] = [False,False,True,False,False]
df_1

rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
1	2	India	Asia	1324171354	0.011
2	3	United States	Americas	322179605	0.007
3	4	Indonesia	Asia	261115456	0.011
4	5	Brazil	Americas	207652865	0.008
df_2
country	nato
0	China	False
1	India	False
2	United States	True
3	Indonesia	False
4	Brazil	False
df_1.merge(df_2)
rank	country	continent	population	change	nato
0	1	China	Asia	1403500365	0.004	False
1	2	India	Asia	1324171354	0.011	False
2	3	United States	Americas	322179605	0.007	True
3	4	Indonesia	Asia	261115456	0.011	False
4	5	Brazil	Americas	207652865	0.008	False

Inner Join

By default the type of join is called inner join. Think of this like an intersection. Expanding on the example above,

df_3 = pd.DataFrame()
df_3["country"] = [ "China","India","United States"]
df_3["nato"] = [False,False,True]
df_3

country	nato
0	China	False
1	India	False
2	United States	True
df_1.merge(df_3)
rank	country	continent	population	change	nato
0	1	China	Asia	1403500365	0.004	False
1	2	India	Asia	1324171354	0.011	False
2	3	United States	Americas	322179605	0.007	True

Left Join

What if the left dataframe has more rows and we want to retain all of them ? Then we use a left join

df_1.merge(df_3,how="left")
rank	country	continent	population	change	nato
0	1	China	Asia	1403500365	0.004	False
1	2	India	Asia	1324171354	0.011	False
2	3	United States	Americas	322179605	0.007	True
3	4	Indonesia	Asia	261115456	0.011	NaN
4	5	Brazil	Americas	207652865	0.008	NaN

Right Join

On the contrary, if the right dataframe has more rows and if you want to preserve all the rows in the right dataframe, use right join

df_1 = df.iloc[0:3,:]
df_1
rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
1	2	India	Asia	1324171354	0.011
2	3	United States	Americas	322179605	0.007
df_3 = pd.DataFrame()
df_3["country"] = [ "China","India","United States","Indonesia","Brazil"]
df_3["nato"] = [False,False,True,False,False]
df_3

country	nato
0	China	False
1	India	False
2	United States	True
3	Indonesia	False
4	Brazil	False
df_1.merge(df_3,how="right")

rank	country	continent	population	change	nato
0	1.0	China	Asia	1.403500e+09	0.004	False
1	2.0	India	Asia	1.324171e+09	0.011	False
2	3.0	United States	Americas	3.221796e+08	0.007	True
3	NaN	Indonesia	NaN	NaN	NaN	False
4	NaN	Brazil	NaN	NaN	NaN	False

Outer Join

There is another type of join called outer join. Let’s try this on the same datasets as above.

df_1.merge(df_3,how="outer")
rank	country	continent	population	change	nato
0	1.0	China	Asia	1.403500e+09	0.004	False
1	2.0	India	Asia	1.324171e+09	0.011	False
2	3.0	United States	Americas	3.221796e+08	0.007	True
3	NaN	Indonesia	NaN	NaN	NaN	False
4	NaN	Brazil	NaN	NaN	NaN	False

Surprisingly, you see the same results, right ? Let’s take a better example to illustrate this.

df_1 = df.iloc[[0,1,2,5,6],:]
df_1

	rank	country	continent	population	change
0	1	China	Asia	1403500365	0.004
1	2	India	Asia	1324171354	0.011
2	3	United States	Americas	322179605	0.007
5	6	Pakistan	Asia	193203476	0.02
6	7	Nigeria	Africa	185989640	0.026
df_3 = df_3.iloc[[0,2,3],:]
df_3


country	nato
0	China	False
2	United States	True
3	Indonesia	False
df_1.merge(df_3,how="outer")

rank	country	continent	population	change	nato
0	1.0	China	Asia	1.403500e+09	0.004	False
1	2.0	India	Asia	1.324171e+09	0.011	NaN
2	3.0	United States	Americas	3.221796e+08	0.007	True
3	6.0	Pakistan	Asia	1.932035e+08	0.02	NaN
4	7.0	Nigeria	Africa	1.859896e+08	0.026	NaN
5	NaN	Indonesia	NaN	NaN	NaN	False

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: