How to “merge” data frames in R

How to “merge” data frames in R


  R Interview Questions

Sometimes, all of your data might not come from a single source. You might have to merge the data after you load it into R.

Scenario 1 – When the data from all the sources have the same structure.

For example, let’s look at the iris data set in R.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
..         ..          ..           ..          ..   ..

The column species contains 3 species of flowers.

> unique(iris$Species)
[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

Say, the data for each of these species come from a separate source. We could very well combine them outside of R ( say in excel ) or bring them into R and merge the data here in R. In order to merge 2 data frames in R, use the merge or rbind functions. rbind() is the simpler version whereas the merge() function can do a whole lot more.

Just for convenience, let’s pick the first 5 rows for each species

# Pick up all the rows where the variable/column Species = "setosa".
> setosa = iris[which(iris$Species == "setosa"),]
# Pick up just the first 3 rows
> setosa = setosa[1:3,]  
> setosa
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Do the same thing for other species as well.

> versicolor = iris[which(iris$Species == "versicolor"),]
> versicolor = versicolor[1:3,]
> virginica = iris[which(iris$Species == "virginica"),]
> virginica = virginica[1:3,]

Now, we got 3 data sets, each with 5 rows and we want to merge them into 1 set. 

> iris_new = rbind(setosa,virginica,versicolor)
> iris_new
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica
104          6.3         2.9          5.6         1.8  virginica
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
54           5.5         2.3          4.0         1.3 versicolor

That was simple, right ? The rbind() function works well when the structure of the data sets is exactly the same – Same set of columns.

Scenario 2 – When some of the variables are missing.

Say we get this data from 3 different source ( one for each species each month ) and this month, we did not get the “Petal” data for the setosa species. We still want to do the analysis. How do we merge the data then ?

# Let's drop the "Petal" data and include only the "Sepal" data. 
> setosa = setosa[,c("Sepal.Length","Sepal.Width","Species")]
> setosa
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa
> iris_new = rbind(versicolor,setosa)
Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

As you can see the rbind fails in this case. A simpler way out in this case is to set dummy columns for petals and set them to NA.

# Let's drop the "Petal" columns and say the "Sepal" data is the only data we got this month
> setosa = setosa[,c("Sepal.Length","Sepal.Width","Species")]
> setosa
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa

Now that we know rbind() does not work in this case, let’s add 2 dummy columns for “Petal” and make the values as NA.

> setosa$Petal.Length = NA
> setosa$Petal.Width = NA
> setosa
  Sepal.Length Sepal.Width Species Petal.Length Petal.Width
1          5.1         3.5  setosa           NA          NA
2          4.9         3.0  setosa           NA          NA
3          4.7         3.2  setosa           NA          NA
4          4.6         3.1  setosa           NA          NA
5          5.0         3.6  setosa           NA          NA

Now that we have got all the data into the same format ( columns ), we can do an rbind just like we did before.

> iris_new = rbind(setosa,versicolor,virginica)
> iris_new
    Sepal.Length Sepal.Width    Species Petal.Length Petal.Width
1            5.1         3.5     setosa           NA          NA
2            4.9         3.0     setosa           NA          NA
3            4.7         3.2     setosa           NA          NA
4            4.6         3.1     setosa           NA          NA
5            5.0         3.6     setosa           NA          NA
51           7.0         3.2 versicolor          4.7         1.4
52           6.4         3.2 versicolor          4.5         1.5
53           6.9         3.1 versicolor          4.9         1.5
54           5.5         2.3 versicolor          4.0         1.3
101          6.3         3.3  virginica          6.0         2.5
102          5.8         2.7  virginica          5.1         1.9
103          7.1         3.0  virginica          5.9         2.1
104          6.3         2.9  virginica          5.6         1.8

Scenario 3 – When we get some variables from one source and other variables from a different source.

Say, you are an analyst at amazon and you are asked to analyze their order data. So, you approach their systems engineer to give you their order data. Typically order data is stored in many tables – that is how relational data is designed ( normalization ). So, the engineer gives you the order header data ( from one table ) into 1 file and order details data ( from another table ) into another file – makes his life easy. However, we have to analyze the data from both these sources together in order to make any meaningful conclusions.

> orders
  order_no order_status
1    13001    Delivered
2    13002      Pending
3    13003    Delivered
4    13004    Delivered
5    13005      Pending
6    13006      Pending
> order_detail
   order_no item     details price
1     13001    1  Samsung S9   600
2     13001    2    iPhone 7   650
3     13002    1 Motorola X6   350
4     13002    2   iPad Mini   450
5     13002    3  Samsung S8   550
6     13003    1  Samsung S9   600
7     13004    1    iPhone 7   650
8     13004    2 Motorola X6   350
9     13004    3   iPad Mini   450
10    13004    4  Samsung S8   550
11    13005    1  Samsung S9   600
12    13005    2    iPhone 7   650
13    13006    1 Motorola X6   350
14    13006    2   iPad Mini   450
15    13006    3  Samsung S8   550

orders.csv

order_details.csv

item_refurbished.csv

When you merge these 2 data sets together, you have to find a common variable/s between these 2 data sets. Luckily, we have the order number field in this case ( field order_no ). In this case, all we have to do is use the merge function as-is and you get the desired results

> order = merge(order_detail,orders)
> order
   order_no item     details price order_status
1     13001    1  Samsung S9   600    Delivered
2     13001    2    iPhone 7   650    Delivered
3     13002    1 Motorola X6   350      Pending
4     13002    2   iPad Mini   450      Pending
5     13002    3  Samsung S8   550      Pending
6     13003    1  Samsung S9   600    Delivered
7     13004    1    iPhone 7   650    Delivered
8     13004    2 Motorola X6   350    Delivered
9     13004    3   iPad Mini   450    Delivered
10    13004    4  Samsung S8   550    Delivered
11    13005    1  Samsung S9   600      Pending
12    13005    2    iPhone 7   650      Pending
13    13006    1 Motorola X6   350      Pending
14    13006    2   iPad Mini   450      Pending
15    13006    3  Samsung S8   550      Pending

Based on the common variable ( order_no ) R has made an implicit decision to combine the data based on order_no variable.

What if the name of the column was different ? just the name is different, but the data is still the same – In this case, how do we tell R that the merge has to happen on that column ( although the names across the data sets are different ).

# Change the column name of order_detail data frame
> colnames(order_detail)[1] = "order_number"
> order_detail
   order_number item     details price
1         13001    1  Samsung S9   600
2         13001    2    iPhone 7   650
3         13002    1 Motorola X6   350
..        ..      ..          ..    ..

Let us see if R can make some sense out of the data

> order = merge(order_detail,orders)
> order
   order_number item     details price order_no order_status
1         13001    1  Samsung S9   600    13001    Delivered
2         13001    2    iPhone 7   650    13001    Delivered
3         13002    1 Motorola X6   350    13001    Delivered
4         13002    2   iPad Mini   450    13001    Delivered
5         13002    3  Samsung S8   550    13001    Delivered
6         13003    1  Samsung S9   600    13001    Delivered
7         13004    1    iPhone 7   650    13001    Delivered
8         13004    2 Motorola X6   350    13001    Delivered
9         13004    3   iPad Mini   450    13001    Delivered
10        13004    4  Samsung S8   550    13001    Delivered
11        13005    1  Samsung S9   600    13001    Delivered
12        13005    2    iPhone 7   650    13001    Delivered
13        13006    1 Motorola X6   350    13001    Delivered
14        13006    2   iPad Mini   450    13001    Delivered
15        13006    3  Samsung S8   550    13001    Delivered
16        13001    1  Samsung S9   600    13002      Pending
17        13001    2    iPhone 7   650    13002      Pending
18        13002    1 Motorola X6   350    13002      Pending
19        13002    2   iPad Mini   450    13002      Pending
20        13002    3  Samsung S8   550    13002      Pending
21        13003    1  Samsung S9   600    13002      Pending
22        13004    1    iPhone 7   650    13002      Pending
23        13004    2 Motorola X6   350    13002      Pending
24        13004    3   iPad Mini   450    13002      Pending
25        13004    4  Samsung S8   550    13002      Pending
26        13005    1  Samsung S9   600    13002      Pending
27        13005    2    iPhone 7   650    13002      Pending
28        13006    1 Motorola X6   350    13002      Pending
29        13006    2   iPad Mini   450    13002      Pending
30        13006    3  Samsung S8   550    13002      Pending
31        13001    1  Samsung S9   600    13003    Delivered
32        13001    2    iPhone 7   650    13003    Delivered
33        13002    1 Motorola X6   350    13003    Delivered
34        13002    2   iPad Mini   450    13003    Delivered
35        13002    3  Samsung S8   550    13003    Delivered
36        13003    1  Samsung S9   600    13003    Delivered
37        13004    1    iPhone 7   650    13003    Delivered
38        13004    2 Motorola X6   350    13003    Delivered
39        13004    3   iPad Mini   450    13003    Delivered
40        13004    4  Samsung S8   550    13003    Delivered
41        13005    1  Samsung S9   600    13003    Delivered
42        13005    2    iPhone 7   650    13003    Delivered
43        13006    1 Motorola X6   350    13003    Delivered
44        13006    2   iPad Mini   450    13003    Delivered
45        13006    3  Samsung S8   550    13003    Delivered
46        13001    1  Samsung S9   600    13004    Delivered
47        13001    2    iPhone 7   650    13004    Delivered
48        13002    1 Motorola X6   350    13004    Delivered
49        13002    2   iPad Mini   450    13004    Delivered
50        13002    3  Samsung S8   550    13004    Delivered
51        13003    1  Samsung S9   600    13004    Delivered
52        13004    1    iPhone 7   650    13004    Delivered
53        13004    2 Motorola X6   350    13004    Delivered
54        13004    3   iPad Mini   450    13004    Delivered
55        13004    4  Samsung S8   550    13004    Delivered
56        13005    1  Samsung S9   600    13004    Delivered
57        13005    2    iPhone 7   650    13004    Delivered
58        13006    1 Motorola X6   350    13004    Delivered
59        13006    2   iPad Mini   450    13004    Delivered
60        13006    3  Samsung S8   550    13004    Delivered
61        13001    1  Samsung S9   600    13005      Pending
62        13001    2    iPhone 7   650    13005      Pending
63        13002    1 Motorola X6   350    13005      Pending
64        13002    2   iPad Mini   450    13005      Pending
65        13002    3  Samsung S8   550    13005      Pending
66        13003    1  Samsung S9   600    13005      Pending
67        13004    1    iPhone 7   650    13005      Pending
68        13004    2 Motorola X6   350    13005      Pending
69        13004    3   iPad Mini   450    13005      Pending
70        13004    4  Samsung S8   550    13005      Pending
71        13005    1  Samsung S9   600    13005      Pending
72        13005    2    iPhone 7   650    13005      Pending
73        13006    1 Motorola X6   350    13005      Pending
74        13006    2   iPad Mini   450    13005      Pending
75        13006    3  Samsung S8   550    13005      Pending
76        13001    1  Samsung S9   600    13006      Pending
77        13001    2    iPhone 7   650    13006      Pending
78        13002    1 Motorola X6   350    13006      Pending
79        13002    2   iPad Mini   450    13006      Pending
80        13002    3  Samsung S8   550    13006      Pending
81        13003    1  Samsung S9   600    13006      Pending
82        13004    1    iPhone 7   650    13006      Pending
83        13004    2 Motorola X6   350    13006      Pending
84        13004    3   iPad Mini   450    13006      Pending
85        13004    4  Samsung S8   550    13006      Pending
86        13005    1  Samsung S9   600    13006      Pending
87        13005    2    iPhone 7   650    13006      Pending
88        13006    1 Motorola X6   350    13006      Pending
89        13006    2   iPad Mini   450    13006      Pending
90        13006    3  Samsung S8   550    13006      Pending

Does this make any sense ? The data has exploded. Why did R do this ? Since R does not know how to combine the data, it has done what is called as a “Cartesian Join” – For every row in table 1 it has combined all the data in the second row. In our case, this is basically useless.

Let’s tell R which column to merge by this time.

# x represents the first data frame and y the second
> order = merge(order_detail,orders,by.x = "order_number",by.y = "order_no")
> order
   order_number item     details price order_status
1         13001    1  Samsung S9   600    Delivered
2         13001    2    iPhone 7   650    Delivered
3         13002    1 Motorola X6   350      Pending
4         13002    2   iPad Mini   450      Pending
5         13002    3  Samsung S8   550      Pending
6         13003    1  Samsung S9   600    Delivered
7         13004    1    iPhone 7   650    Delivered
8         13004    2 Motorola X6   350    Delivered
9         13004    3   iPad Mini   450    Delivered
10        13004    4  Samsung S8   550    Delivered
11        13005    1  Samsung S9   600      Pending
12        13005    2    iPhone 7   650      Pending
13        13006    1 Motorola X6   350      Pending
14        13006    2   iPad Mini   450      Pending
15        13006    3  Samsung S8   550      Pending

Much better, isn’t it ? In fact, merging data need not just be based on just 1 column ( in this case order number ). It could be based on multiple columns – as long as the column names are properly specified and the data is correct.

For example, if there is another table (data frame ) that contains “refurbished” data at the line item level, like below.

> item_refurbished
   order_no item refurbished
1     13001    1         yes
2     13001    2          no
3     13002    1         yes
4     13002    2         yes
5     13002    3          no
6     13003    1         yes
7     13004    1         yes
8     13004    2         yes
9     13004    3          no
10    13004    4          no
11    13005    1          no
12    13005    2          no
13    13006    1          no
14    13006    2         yes
15    13006    3         yes

You can now combine the order_detail and item_refurbished data based on 2 columns –

  1. order_no
  2. item
> order_detail_all = merge(order_detail,item_refurbished)
> order_detail_all
   order_no item     details price refurbished
1     13001    1  Samsung S9   600         yes
2     13001    2    iPhone 7   650          no
3     13002    1 Motorola X6   350         yes
4     13002    2   iPad Mini   450         yes
5     13002    3  Samsung S8   550          no
6     13003    1  Samsung S9   600         yes
7     13004    1    iPhone 7   650         yes
8     13004    2 Motorola X6   350         yes
9     13004    3   iPad Mini   450          no
10    13004    4  Samsung S8   550          no
11    13005    1  Samsung S9   600          no
12    13005    2    iPhone 7   650          no
13    13006    1 Motorola X6   350          no
14    13006    2   iPad Mini   450         yes
15    13006    3  Samsung S8   550         yes

Based on the column names R was able to implicitly merge the data sets.

Scenario 4 – One data frame is bigger ( more rows ) than the other.

For example, some of the orders don’t have items – what happens in that case ?

# Let's just add another order to the orders table, but that row would not exist in the details table
> orders[7,] = c(13007,"Delivered")
> orders
  order_no order_status
1    13001    Delivered
2    13002      Pending
3    13003    Delivered
4    13004    Delivered
5    13005      Pending
6    13006      Pending
7    13007    Delivered
# When you do a default merge, the unmatched order ( 13007 ) will not come up. 
> orders_all = merge(orders,order_detail)
> orders_all
   order_no order_status item     details price
1     13001    Delivered    1  Samsung S9   600
2     13001    Delivered    2    iPhone 7   650
3     13002      Pending    1 Motorola X6   350
4     13002      Pending    2   iPad Mini   450
5     13002      Pending    3  Samsung S8   550
6     13003    Delivered    1  Samsung S9   600
7     13004    Delivered    1    iPhone 7   650
8     13004    Delivered    2 Motorola X6   350
9     13004    Delivered    3   iPad Mini   450
10    13004    Delivered    4  Samsung S8   550
11    13005      Pending    1  Samsung S9   600
12    13005      Pending    2    iPhone 7   650
13    13006      Pending    1 Motorola X6   350
14    13006      Pending    2   iPad Mini   450
15    13006      Pending    3  Samsung S8   550

If we want to include the new order, then the item details would not have any values against them. And the way to do it is via the all.x or all.y parameter.

# all.x or all.y is set to FALSE by default. That's why only "common" rows are merged.
> orders_all = merge(orders,order_detail,all.x = TRUE)
> orders_all
   order_no order_status item     details price
1     13001    Delivered    1  Samsung S9   600
2     13001    Delivered    2    iPhone 7   650
3     13002      Pending    1 Motorola X6   350
4     13002      Pending    2   iPad Mini   450
5     13002      Pending    3  Samsung S8   550
6     13003    Delivered    1  Samsung S9   600
7     13004    Delivered    1    iPhone 7   650
8     13004    Delivered    2 Motorola X6   350
9     13004    Delivered    3   iPad Mini   450
10    13004    Delivered    4  Samsung S8   550
11    13005      Pending    1  Samsung S9   600
12    13005      Pending    2    iPhone 7   650
13    13006      Pending    1 Motorola X6   350
14    13006      Pending    2   iPad Mini   450
15    13006      Pending    3  Samsung S8   550
16    13007    Delivered   NA        <NA&gt;    NA

Scenario 5 – Merge by row names

This is a special case where data set A has some rows and data set B has another set of rows and we want to combine these ( more or less like a cbind ).The merging however is not based on common variables, but based on common row names.

For example, look at the inbuilt mtcars dataset.

&gt; mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
..                  ..     ..      ..  ..   ..    ..     ..

Say, 2 teams are testing each of these cars and team A gives the first 5 variables and team B gives the remaining 6 variables. There is no common columns though. The only common parameter is the row names in the data frames.

# Let's split the mtcars into team A and team B variables
&gt; team_a = mtcars[,1:5]
&gt; team_a
                     mpg cyl  disp  hp drat
Mazda RX4           21.0   6 160.0 110 3.90
Mazda RX4 Wag       21.0   6 160.0 110 3.90
Datsun 710          22.8   4 108.0  93 3.85
Hornet 4 Drive      21.4   6 258.0 110 3.08
&gt; team_b = mtcars[,6:11]
&gt; team_b
                       wt  qsec vs am gear carb
Mazda RX4           2.620 16.46  0  1    4    4
Mazda RX4 Wag       2.875 17.02  0  1    4    4
Datsun 710          2.320 18.61  1  1    4    1
Hornet 4 Drive      3.215 19.44  1  0    3    1

If we have to merge these two datasets, it should be by row names ( since there are no common columns )

&gt; mtcars_all = merge(team_a, team_b, by = "row.names")
&gt; mtcars_all
             Row.names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1          AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
2   Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
3           Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
4    Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
5           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: