How to “merge” data frames in R
Sometimes, all of your data might not come from a single source. You might have to merge the data after you load it into R.
Scenario 1 – When the data from all the sources have the same structure.
For example, let’s look at the iris data set in R.
> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
.. .. .. .. .. ..
The column species contains 3 species of flowers.
> unique(iris$Species)
[1] setosa versicolor virginica
Levels: setosa versicolor virginica
Say, the data for each of these species come from a separate source. We could very well combine them outside of R ( say in excel ) or bring them into R and merge the data here in R. In order to merge 2 data frames in R, use the merge or rbind functions. rbind() is the simpler version whereas the merge() function can do a whole lot more.
Just for convenience, let’s pick the first 5 rows for each species
# Pick up all the rows where the variable/column Species = "setosa".
> setosa = iris[which(iris$Species == "setosa"),]
# Pick up just the first 3 rows
> setosa = setosa[1:3,]
> setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
Do the same thing for other species as well.
> versicolor = iris[which(iris$Species == "versicolor"),]
> versicolor = versicolor[1:3,]
> virginica = iris[which(iris$Species == "virginica"),]
> virginica = virginica[1:3,]
Now, we got 3 data sets, each with 5 rows and we want to merge them into 1 set.
> iris_new = rbind(setosa,virginica,versicolor)
> iris_new
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
51 7.0 3.2 4.7 1.4 versicolor
52 6.4 3.2 4.5 1.5 versicolor
53 6.9 3.1 4.9 1.5 versicolor
54 5.5 2.3 4.0 1.3 versicolor
That was simple, right ? The rbind() function works well when the structure of the data sets is exactly the same – Same set of columns.
Scenario 2 – When some of the variables are missing.
Say we get this data from 3 different source ( one for each species each month ) and this month, we did not get the “Petal” data for the setosa species. We still want to do the analysis. How do we merge the data then ?
# Let's drop the "Petal" data and include only the "Sepal" data.
> setosa = setosa[,c("Sepal.Length","Sepal.Width","Species")]
> setosa
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
> iris_new = rbind(versicolor,setosa)
Error in rbind(deparse.level, ...) :
numbers of columns of arguments do not match
As you can see the rbind fails in this case. A simpler way out in this case is to set dummy columns for petals and set them to NA.
# Let's drop the "Petal" columns and say the "Sepal" data is the only data we got this month
> setosa = setosa[,c("Sepal.Length","Sepal.Width","Species")]
> setosa
Sepal.Length Sepal.Width Species
1 5.1 3.5 setosa
2 4.9 3.0 setosa
3 4.7 3.2 setosa
4 4.6 3.1 setosa
5 5.0 3.6 setosa
Now that we know rbind() does not work in this case, let’s add 2 dummy columns for “Petal” and make the values as NA.
> setosa$Petal.Length = NA
> setosa$Petal.Width = NA
> setosa
Sepal.Length Sepal.Width Species Petal.Length Petal.Width
1 5.1 3.5 setosa NA NA
2 4.9 3.0 setosa NA NA
3 4.7 3.2 setosa NA NA
4 4.6 3.1 setosa NA NA
5 5.0 3.6 setosa NA NA
Now that we have got all the data into the same format ( columns ), we can do an rbind just like we did before.
> iris_new = rbind(setosa,versicolor,virginica)
> iris_new
Sepal.Length Sepal.Width Species Petal.Length Petal.Width
1 5.1 3.5 setosa NA NA
2 4.9 3.0 setosa NA NA
3 4.7 3.2 setosa NA NA
4 4.6 3.1 setosa NA NA
5 5.0 3.6 setosa NA NA
51 7.0 3.2 versicolor 4.7 1.4
52 6.4 3.2 versicolor 4.5 1.5
53 6.9 3.1 versicolor 4.9 1.5
54 5.5 2.3 versicolor 4.0 1.3
101 6.3 3.3 virginica 6.0 2.5
102 5.8 2.7 virginica 5.1 1.9
103 7.1 3.0 virginica 5.9 2.1
104 6.3 2.9 virginica 5.6 1.8
Scenario 3 – When we get some variables from one source and other variables from a different source.
Say, you are an analyst at amazon and you are asked to analyze their order data. So, you approach their systems engineer to give you their order data. Typically order data is stored in many tables – that is how relational data is designed ( normalization ). So, the engineer gives you the order header data ( from one table ) into 1 file and order details data ( from another table ) into another file – makes his life easy. However, we have to analyze the data from both these sources together in order to make any meaningful conclusions.
> orders
order_no order_status
1 13001 Delivered
2 13002 Pending
3 13003 Delivered
4 13004 Delivered
5 13005 Pending
6 13006 Pending
> order_detail
order_no item details price
1 13001 1 Samsung S9 600
2 13001 2 iPhone 7 650
3 13002 1 Motorola X6 350
4 13002 2 iPad Mini 450
5 13002 3 Samsung S8 550
6 13003 1 Samsung S9 600
7 13004 1 iPhone 7 650
8 13004 2 Motorola X6 350
9 13004 3 iPad Mini 450
10 13004 4 Samsung S8 550
11 13005 1 Samsung S9 600
12 13005 2 iPhone 7 650
13 13006 1 Motorola X6 350
14 13006 2 iPad Mini 450
15 13006 3 Samsung S8 550
When you merge these 2 data sets together, you have to find a common variable/s between these 2 data sets. Luckily, we have the order number field in this case ( field order_no ). In this case, all we have to do is use the merge function as-is and you get the desired results
> order = merge(order_detail,orders)
> order
order_no item details price order_status
1 13001 1 Samsung S9 600 Delivered
2 13001 2 iPhone 7 650 Delivered
3 13002 1 Motorola X6 350 Pending
4 13002 2 iPad Mini 450 Pending
5 13002 3 Samsung S8 550 Pending
6 13003 1 Samsung S9 600 Delivered
7 13004 1 iPhone 7 650 Delivered
8 13004 2 Motorola X6 350 Delivered
9 13004 3 iPad Mini 450 Delivered
10 13004 4 Samsung S8 550 Delivered
11 13005 1 Samsung S9 600 Pending
12 13005 2 iPhone 7 650 Pending
13 13006 1 Motorola X6 350 Pending
14 13006 2 iPad Mini 450 Pending
15 13006 3 Samsung S8 550 Pending
Based on the common variable ( order_no ) R has made an implicit decision to combine the data based on order_no variable.
What if the name of the column was different ? just the name is different, but the data is still the same – In this case, how do we tell R that the merge has to happen on that column ( although the names across the data sets are different ).
# Change the column name of order_detail data frame
> colnames(order_detail)[1] = "order_number"
> order_detail
order_number item details price
1 13001 1 Samsung S9 600
2 13001 2 iPhone 7 650
3 13002 1 Motorola X6 350
.. .. .. .. ..
Let us see if R can make some sense out of the data
> order = merge(order_detail,orders)
> order
order_number item details price order_no order_status
1 13001 1 Samsung S9 600 13001 Delivered
2 13001 2 iPhone 7 650 13001 Delivered
3 13002 1 Motorola X6 350 13001 Delivered
4 13002 2 iPad Mini 450 13001 Delivered
5 13002 3 Samsung S8 550 13001 Delivered
6 13003 1 Samsung S9 600 13001 Delivered
7 13004 1 iPhone 7 650 13001 Delivered
8 13004 2 Motorola X6 350 13001 Delivered
9 13004 3 iPad Mini 450 13001 Delivered
10 13004 4 Samsung S8 550 13001 Delivered
11 13005 1 Samsung S9 600 13001 Delivered
12 13005 2 iPhone 7 650 13001 Delivered
13 13006 1 Motorola X6 350 13001 Delivered
14 13006 2 iPad Mini 450 13001 Delivered
15 13006 3 Samsung S8 550 13001 Delivered
16 13001 1 Samsung S9 600 13002 Pending
17 13001 2 iPhone 7 650 13002 Pending
18 13002 1 Motorola X6 350 13002 Pending
19 13002 2 iPad Mini 450 13002 Pending
20 13002 3 Samsung S8 550 13002 Pending
21 13003 1 Samsung S9 600 13002 Pending
22 13004 1 iPhone 7 650 13002 Pending
23 13004 2 Motorola X6 350 13002 Pending
24 13004 3 iPad Mini 450 13002 Pending
25 13004 4 Samsung S8 550 13002 Pending
26 13005 1 Samsung S9 600 13002 Pending
27 13005 2 iPhone 7 650 13002 Pending
28 13006 1 Motorola X6 350 13002 Pending
29 13006 2 iPad Mini 450 13002 Pending
30 13006 3 Samsung S8 550 13002 Pending
31 13001 1 Samsung S9 600 13003 Delivered
32 13001 2 iPhone 7 650 13003 Delivered
33 13002 1 Motorola X6 350 13003 Delivered
34 13002 2 iPad Mini 450 13003 Delivered
35 13002 3 Samsung S8 550 13003 Delivered
36 13003 1 Samsung S9 600 13003 Delivered
37 13004 1 iPhone 7 650 13003 Delivered
38 13004 2 Motorola X6 350 13003 Delivered
39 13004 3 iPad Mini 450 13003 Delivered
40 13004 4 Samsung S8 550 13003 Delivered
41 13005 1 Samsung S9 600 13003 Delivered
42 13005 2 iPhone 7 650 13003 Delivered
43 13006 1 Motorola X6 350 13003 Delivered
44 13006 2 iPad Mini 450 13003 Delivered
45 13006 3 Samsung S8 550 13003 Delivered
46 13001 1 Samsung S9 600 13004 Delivered
47 13001 2 iPhone 7 650 13004 Delivered
48 13002 1 Motorola X6 350 13004 Delivered
49 13002 2 iPad Mini 450 13004 Delivered
50 13002 3 Samsung S8 550 13004 Delivered
51 13003 1 Samsung S9 600 13004 Delivered
52 13004 1 iPhone 7 650 13004 Delivered
53 13004 2 Motorola X6 350 13004 Delivered
54 13004 3 iPad Mini 450 13004 Delivered
55 13004 4 Samsung S8 550 13004 Delivered
56 13005 1 Samsung S9 600 13004 Delivered
57 13005 2 iPhone 7 650 13004 Delivered
58 13006 1 Motorola X6 350 13004 Delivered
59 13006 2 iPad Mini 450 13004 Delivered
60 13006 3 Samsung S8 550 13004 Delivered
61 13001 1 Samsung S9 600 13005 Pending
62 13001 2 iPhone 7 650 13005 Pending
63 13002 1 Motorola X6 350 13005 Pending
64 13002 2 iPad Mini 450 13005 Pending
65 13002 3 Samsung S8 550 13005 Pending
66 13003 1 Samsung S9 600 13005 Pending
67 13004 1 iPhone 7 650 13005 Pending
68 13004 2 Motorola X6 350 13005 Pending
69 13004 3 iPad Mini 450 13005 Pending
70 13004 4 Samsung S8 550 13005 Pending
71 13005 1 Samsung S9 600 13005 Pending
72 13005 2 iPhone 7 650 13005 Pending
73 13006 1 Motorola X6 350 13005 Pending
74 13006 2 iPad Mini 450 13005 Pending
75 13006 3 Samsung S8 550 13005 Pending
76 13001 1 Samsung S9 600 13006 Pending
77 13001 2 iPhone 7 650 13006 Pending
78 13002 1 Motorola X6 350 13006 Pending
79 13002 2 iPad Mini 450 13006 Pending
80 13002 3 Samsung S8 550 13006 Pending
81 13003 1 Samsung S9 600 13006 Pending
82 13004 1 iPhone 7 650 13006 Pending
83 13004 2 Motorola X6 350 13006 Pending
84 13004 3 iPad Mini 450 13006 Pending
85 13004 4 Samsung S8 550 13006 Pending
86 13005 1 Samsung S9 600 13006 Pending
87 13005 2 iPhone 7 650 13006 Pending
88 13006 1 Motorola X6 350 13006 Pending
89 13006 2 iPad Mini 450 13006 Pending
90 13006 3 Samsung S8 550 13006 Pending
Does this make any sense ? The data has exploded. Why did R do this ? Since R does not know how to combine the data, it has done what is called as a “Cartesian Join” – For every row in table 1 it has combined all the data in the second row. In our case, this is basically useless.
Let’s tell R which column to merge by this time.
# x represents the first data frame and y the second
> order = merge(order_detail,orders,by.x = "order_number",by.y = "order_no")
> order
order_number item details price order_status
1 13001 1 Samsung S9 600 Delivered
2 13001 2 iPhone 7 650 Delivered
3 13002 1 Motorola X6 350 Pending
4 13002 2 iPad Mini 450 Pending
5 13002 3 Samsung S8 550 Pending
6 13003 1 Samsung S9 600 Delivered
7 13004 1 iPhone 7 650 Delivered
8 13004 2 Motorola X6 350 Delivered
9 13004 3 iPad Mini 450 Delivered
10 13004 4 Samsung S8 550 Delivered
11 13005 1 Samsung S9 600 Pending
12 13005 2 iPhone 7 650 Pending
13 13006 1 Motorola X6 350 Pending
14 13006 2 iPad Mini 450 Pending
15 13006 3 Samsung S8 550 Pending
Much better, isn’t it ? In fact, merging data need not just be based on just 1 column ( in this case order number ). It could be based on multiple columns – as long as the column names are properly specified and the data is correct.
For example, if there is another table (data frame ) that contains “refurbished” data at the line item level, like below.
> item_refurbished
order_no item refurbished
1 13001 1 yes
2 13001 2 no
3 13002 1 yes
4 13002 2 yes
5 13002 3 no
6 13003 1 yes
7 13004 1 yes
8 13004 2 yes
9 13004 3 no
10 13004 4 no
11 13005 1 no
12 13005 2 no
13 13006 1 no
14 13006 2 yes
15 13006 3 yes
You can now combine the order_detail and item_refurbished data based on 2 columns –
- order_no
- item
> order_detail_all = merge(order_detail,item_refurbished)
> order_detail_all
order_no item details price refurbished
1 13001 1 Samsung S9 600 yes
2 13001 2 iPhone 7 650 no
3 13002 1 Motorola X6 350 yes
4 13002 2 iPad Mini 450 yes
5 13002 3 Samsung S8 550 no
6 13003 1 Samsung S9 600 yes
7 13004 1 iPhone 7 650 yes
8 13004 2 Motorola X6 350 yes
9 13004 3 iPad Mini 450 no
10 13004 4 Samsung S8 550 no
11 13005 1 Samsung S9 600 no
12 13005 2 iPhone 7 650 no
13 13006 1 Motorola X6 350 no
14 13006 2 iPad Mini 450 yes
15 13006 3 Samsung S8 550 yes
Based on the column names R was able to implicitly merge the data sets.
Scenario 4 – One data frame is bigger ( more rows ) than the other.
For example, some of the orders don’t have items – what happens in that case ?
# Let's just add another order to the orders table, but that row would not exist in the details table
> orders[7,] = c(13007,"Delivered")
> orders
order_no order_status
1 13001 Delivered
2 13002 Pending
3 13003 Delivered
4 13004 Delivered
5 13005 Pending
6 13006 Pending
7 13007 Delivered
# When you do a default merge, the unmatched order ( 13007 ) will not come up.
> orders_all = merge(orders,order_detail)
> orders_all
order_no order_status item details price
1 13001 Delivered 1 Samsung S9 600
2 13001 Delivered 2 iPhone 7 650
3 13002 Pending 1 Motorola X6 350
4 13002 Pending 2 iPad Mini 450
5 13002 Pending 3 Samsung S8 550
6 13003 Delivered 1 Samsung S9 600
7 13004 Delivered 1 iPhone 7 650
8 13004 Delivered 2 Motorola X6 350
9 13004 Delivered 3 iPad Mini 450
10 13004 Delivered 4 Samsung S8 550
11 13005 Pending 1 Samsung S9 600
12 13005 Pending 2 iPhone 7 650
13 13006 Pending 1 Motorola X6 350
14 13006 Pending 2 iPad Mini 450
15 13006 Pending 3 Samsung S8 550
If we want to include the new order, then the item details would not have any values against them. And the way to do it is via the all.x or all.y parameter.
# all.x or all.y is set to FALSE by default. That's why only "common" rows are merged.
> orders_all = merge(orders,order_detail,all.x = TRUE)
> orders_all
order_no order_status item details price
1 13001 Delivered 1 Samsung S9 600
2 13001 Delivered 2 iPhone 7 650
3 13002 Pending 1 Motorola X6 350
4 13002 Pending 2 iPad Mini 450
5 13002 Pending 3 Samsung S8 550
6 13003 Delivered 1 Samsung S9 600
7 13004 Delivered 1 iPhone 7 650
8 13004 Delivered 2 Motorola X6 350
9 13004 Delivered 3 iPad Mini 450
10 13004 Delivered 4 Samsung S8 550
11 13005 Pending 1 Samsung S9 600
12 13005 Pending 2 iPhone 7 650
13 13006 Pending 1 Motorola X6 350
14 13006 Pending 2 iPad Mini 450
15 13006 Pending 3 Samsung S8 550
16 13007 Delivered NA <NA> NA
Scenario 5 – Merge by row names
This is a special case where data set A has some rows and data set B has another set of rows and we want to combine these ( more or less like a cbind ).The merging however is not based on common variables, but based on common row names.
For example, look at the inbuilt mtcars dataset.
> mtcars
mpg cyl disp hp drat wt qsec vs am gear carb
Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
.. .. .. .. .. .. .. ..
Say, 2 teams are testing each of these cars and team A gives the first 5 variables and team B gives the remaining 6 variables. There is no common columns though. The only common parameter is the row names in the data frames.
# Let's split the mtcars into team A and team B variables
> team_a = mtcars[,1:5]
> team_a
mpg cyl disp hp drat
Mazda RX4 21.0 6 160.0 110 3.90
Mazda RX4 Wag 21.0 6 160.0 110 3.90
Datsun 710 22.8 4 108.0 93 3.85
Hornet 4 Drive 21.4 6 258.0 110 3.08
> team_b = mtcars[,6:11]
> team_b
wt qsec vs am gear carb
Mazda RX4 2.620 16.46 0 1 4 4
Mazda RX4 Wag 2.875 17.02 0 1 4 4
Datsun 710 2.320 18.61 1 1 4 1
Hornet 4 Drive 3.215 19.44 1 0 3 1
If we have to merge these two datasets, it should be by row names ( since there are no common columns )
> mtcars_all = merge(team_a, team_b, by = "row.names")
> mtcars_all
Row.names mpg cyl disp hp drat wt qsec vs am gear carb
1 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
2 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
3 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
4 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
5 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1