How do you search/replace text in a data frame

How do you search/replace text in a data frame


  R Interview Questions

Finding out if a data frame contains a particular piece of text or feature value is not possible with the standard filters. For example, take the iris data set ( abridged )

> iris_a = iris[c(1:3,51:53,101:103),]
> iris_a
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica

Say the “Species” field has so many different kinds of flower species. How do you find out if there is a flower that starts with “vir” ?

> iris_a[iris$Species == "vir*",]
[1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
<0 rows> (or 0-length row.names)

Obviously, this does not work. That is because the regular search and filter features of a data frame are not based on regular expressions. Try grep() instead

# grep is looking for text at a particular column ( a vector ) 
# and returning all the rows in the vector with a match
> rows = grep("vir",iris_a$Species)
> rows
[1] 7 8 9
> iris_a[rows,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica

Since the first parameter in the grep function is a regular expression, this works too.

# It does not matter where in the word the regular expression sub string is
> rows = grep("gin",iris_a$Species)
> iris_a[rows,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
# If you want to ignore case, use option ignore.case = TRUE
# For ex., this would still pick up all the virginica species
> rows = grep("gIn",iris_a$Species)
> iris_a[rows,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica

There are many ways in which this search can happen

  • “^vir” – Find all the Species starting with “vir” and not in the middle
  • “.irg” – Find all the species starting with any first letter, but the next set of letters should be irg.

All of these variations can be captured through the use of regular expressions – which is a subject in itself.

grep () only works on vectors – that is the reason why we are supplying just the species column as a parameter. grep() does not work on multiple vectors at once ( like a data frame ). 

How to replace text in a data frame

Once you search for data, you might want to replace some of the text. How do you do it ? use the R function – gsub() . For example, let’s look at a subset of iris data.

> iris_small
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica

Say, you want to replace all species starting with v with a capital “V”. Try this.

> iris_small$Species = gsub("^v","V",iris_small$Species)
> iris_small
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
51           7.0         3.2          4.7         1.4 Versicolor
52           6.4         3.2          4.5         1.5 Versicolor
53           6.9         3.1          4.9         1.5 Versicolor
101          6.3         3.3          6.0         2.5  Virginica
102          5.8         2.7          5.1         1.9  Virginica
103          7.1         3.0          5.9         2.1  Virginica

You can very well use any regular expression in the first argument to gsub(). There is another function called sub() that only substitutes the first occurrence, but it is rarely used.

How to “merge” data frames in R

How to “merge” data frames in R


  R Interview Questions

Sometimes, all of your data might not come from a single source. You might have to merge the data after you load it into R.

Scenario 1 – When the data from all the sources have the same structure.

For example, let’s look at the iris data set in R.

> head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
..         ..          ..           ..          ..   ..

The column species contains 3 species of flowers.

> unique(iris$Species)
[1] setosa     versicolor virginica 
Levels: setosa versicolor virginica

Say, the data for each of these species come from a separate source. We could very well combine them outside of R ( say in excel ) or bring them into R and merge the data here in R. In order to merge 2 data frames in R, use the merge or rbind functions. rbind() is the simpler version whereas the merge() function can do a whole lot more.

Just for convenience, let’s pick the first 5 rows for each species

# Pick up all the rows where the variable/column Species = "setosa".
> setosa = iris[which(iris$Species == "setosa"),]
# Pick up just the first 3 rows
> setosa = setosa[1:3,]  
> setosa
   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

Do the same thing for other species as well.

> versicolor = iris[which(iris$Species == "versicolor"),]
> versicolor = versicolor[1:3,]
> virginica = iris[which(iris$Species == "virginica"),]
> virginica = virginica[1:3,]

Now, we got 3 data sets, each with 5 rows and we want to merge them into 1 set. 

> iris_new = rbind(setosa,virginica,versicolor)
> iris_new
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
4            4.6         3.1          1.5         0.2     setosa
5            5.0         3.6          1.4         0.2     setosa
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica
104          6.3         2.9          5.6         1.8  virginica
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
54           5.5         2.3          4.0         1.3 versicolor

That was simple, right ? The rbind() function works well when the structure of the data sets is exactly the same – Same set of columns.

Scenario 2 – When some of the variables are missing.

Say we get this data from 3 different source ( one for each species each month ) and this month, we did not get the “Petal” data for the setosa species. We still want to do the analysis. How do we merge the data then ?

# Let's drop the "Petal" data and include only the "Sepal" data. 
> setosa = setosa[,c("Sepal.Length","Sepal.Width","Species")]
> setosa
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa
> iris_new = rbind(versicolor,setosa)
Error in rbind(deparse.level, ...) : 
  numbers of columns of arguments do not match

As you can see the rbind fails in this case. A simpler way out in this case is to set dummy columns for petals and set them to NA.

# Let's drop the "Petal" columns and say the "Sepal" data is the only data we got this month
> setosa = setosa[,c("Sepal.Length","Sepal.Width","Species")]
> setosa
  Sepal.Length Sepal.Width Species
1          5.1         3.5  setosa
2          4.9         3.0  setosa
3          4.7         3.2  setosa
4          4.6         3.1  setosa
5          5.0         3.6  setosa

Now that we know rbind() does not work in this case, let’s add 2 dummy columns for “Petal” and make the values as NA.

> setosa$Petal.Length = NA
> setosa$Petal.Width = NA
> setosa
  Sepal.Length Sepal.Width Species Petal.Length Petal.Width
1          5.1         3.5  setosa           NA          NA
2          4.9         3.0  setosa           NA          NA
3          4.7         3.2  setosa           NA          NA
4          4.6         3.1  setosa           NA          NA
5          5.0         3.6  setosa           NA          NA

Now that we have got all the data into the same format ( columns ), we can do an rbind just like we did before.

> iris_new = rbind(setosa,versicolor,virginica)
> iris_new
    Sepal.Length Sepal.Width    Species Petal.Length Petal.Width
1            5.1         3.5     setosa           NA          NA
2            4.9         3.0     setosa           NA          NA
3            4.7         3.2     setosa           NA          NA
4            4.6         3.1     setosa           NA          NA
5            5.0         3.6     setosa           NA          NA
51           7.0         3.2 versicolor          4.7         1.4
52           6.4         3.2 versicolor          4.5         1.5
53           6.9         3.1 versicolor          4.9         1.5
54           5.5         2.3 versicolor          4.0         1.3
101          6.3         3.3  virginica          6.0         2.5
102          5.8         2.7  virginica          5.1         1.9
103          7.1         3.0  virginica          5.9         2.1
104          6.3         2.9  virginica          5.6         1.8

Scenario 3 – When we get some variables from one source and other variables from a different source.

Say, you are an analyst at amazon and you are asked to analyze their order data. So, you approach their systems engineer to give you their order data. Typically order data is stored in many tables – that is how relational data is designed ( normalization ). So, the engineer gives you the order header data ( from one table ) into 1 file and order details data ( from another table ) into another file – makes his life easy. However, we have to analyze the data from both these sources together in order to make any meaningful conclusions.

> orders
  order_no order_status
1    13001    Delivered
2    13002      Pending
3    13003    Delivered
4    13004    Delivered
5    13005      Pending
6    13006      Pending
> order_detail
   order_no item     details price
1     13001    1  Samsung S9   600
2     13001    2    iPhone 7   650
3     13002    1 Motorola X6   350
4     13002    2   iPad Mini   450
5     13002    3  Samsung S8   550
6     13003    1  Samsung S9   600
7     13004    1    iPhone 7   650
8     13004    2 Motorola X6   350
9     13004    3   iPad Mini   450
10    13004    4  Samsung S8   550
11    13005    1  Samsung S9   600
12    13005    2    iPhone 7   650
13    13006    1 Motorola X6   350
14    13006    2   iPad Mini   450
15    13006    3  Samsung S8   550

orders.csv

order_details.csv

item_refurbished.csv

When you merge these 2 data sets together, you have to find a common variable/s between these 2 data sets. Luckily, we have the order number field in this case ( field order_no ). In this case, all we have to do is use the merge function as-is and you get the desired results

> order = merge(order_detail,orders)
> order
   order_no item     details price order_status
1     13001    1  Samsung S9   600    Delivered
2     13001    2    iPhone 7   650    Delivered
3     13002    1 Motorola X6   350      Pending
4     13002    2   iPad Mini   450      Pending
5     13002    3  Samsung S8   550      Pending
6     13003    1  Samsung S9   600    Delivered
7     13004    1    iPhone 7   650    Delivered
8     13004    2 Motorola X6   350    Delivered
9     13004    3   iPad Mini   450    Delivered
10    13004    4  Samsung S8   550    Delivered
11    13005    1  Samsung S9   600      Pending
12    13005    2    iPhone 7   650      Pending
13    13006    1 Motorola X6   350      Pending
14    13006    2   iPad Mini   450      Pending
15    13006    3  Samsung S8   550      Pending

Based on the common variable ( order_no ) R has made an implicit decision to combine the data based on order_no variable.

What if the name of the column was different ? just the name is different, but the data is still the same – In this case, how do we tell R that the merge has to happen on that column ( although the names across the data sets are different ).

# Change the column name of order_detail data frame
> colnames(order_detail)[1] = "order_number"
> order_detail
   order_number item     details price
1         13001    1  Samsung S9   600
2         13001    2    iPhone 7   650
3         13002    1 Motorola X6   350
..        ..      ..          ..    ..

Let us see if R can make some sense out of the data

> order = merge(order_detail,orders)
> order
   order_number item     details price order_no order_status
1         13001    1  Samsung S9   600    13001    Delivered
2         13001    2    iPhone 7   650    13001    Delivered
3         13002    1 Motorola X6   350    13001    Delivered
4         13002    2   iPad Mini   450    13001    Delivered
5         13002    3  Samsung S8   550    13001    Delivered
6         13003    1  Samsung S9   600    13001    Delivered
7         13004    1    iPhone 7   650    13001    Delivered
8         13004    2 Motorola X6   350    13001    Delivered
9         13004    3   iPad Mini   450    13001    Delivered
10        13004    4  Samsung S8   550    13001    Delivered
11        13005    1  Samsung S9   600    13001    Delivered
12        13005    2    iPhone 7   650    13001    Delivered
13        13006    1 Motorola X6   350    13001    Delivered
14        13006    2   iPad Mini   450    13001    Delivered
15        13006    3  Samsung S8   550    13001    Delivered
16        13001    1  Samsung S9   600    13002      Pending
17        13001    2    iPhone 7   650    13002      Pending
18        13002    1 Motorola X6   350    13002      Pending
19        13002    2   iPad Mini   450    13002      Pending
20        13002    3  Samsung S8   550    13002      Pending
21        13003    1  Samsung S9   600    13002      Pending
22        13004    1    iPhone 7   650    13002      Pending
23        13004    2 Motorola X6   350    13002      Pending
24        13004    3   iPad Mini   450    13002      Pending
25        13004    4  Samsung S8   550    13002      Pending
26        13005    1  Samsung S9   600    13002      Pending
27        13005    2    iPhone 7   650    13002      Pending
28        13006    1 Motorola X6   350    13002      Pending
29        13006    2   iPad Mini   450    13002      Pending
30        13006    3  Samsung S8   550    13002      Pending
31        13001    1  Samsung S9   600    13003    Delivered
32        13001    2    iPhone 7   650    13003    Delivered
33        13002    1 Motorola X6   350    13003    Delivered
34        13002    2   iPad Mini   450    13003    Delivered
35        13002    3  Samsung S8   550    13003    Delivered
36        13003    1  Samsung S9   600    13003    Delivered
37        13004    1    iPhone 7   650    13003    Delivered
38        13004    2 Motorola X6   350    13003    Delivered
39        13004    3   iPad Mini   450    13003    Delivered
40        13004    4  Samsung S8   550    13003    Delivered
41        13005    1  Samsung S9   600    13003    Delivered
42        13005    2    iPhone 7   650    13003    Delivered
43        13006    1 Motorola X6   350    13003    Delivered
44        13006    2   iPad Mini   450    13003    Delivered
45        13006    3  Samsung S8   550    13003    Delivered
46        13001    1  Samsung S9   600    13004    Delivered
47        13001    2    iPhone 7   650    13004    Delivered
48        13002    1 Motorola X6   350    13004    Delivered
49        13002    2   iPad Mini   450    13004    Delivered
50        13002    3  Samsung S8   550    13004    Delivered
51        13003    1  Samsung S9   600    13004    Delivered
52        13004    1    iPhone 7   650    13004    Delivered
53        13004    2 Motorola X6   350    13004    Delivered
54        13004    3   iPad Mini   450    13004    Delivered
55        13004    4  Samsung S8   550    13004    Delivered
56        13005    1  Samsung S9   600    13004    Delivered
57        13005    2    iPhone 7   650    13004    Delivered
58        13006    1 Motorola X6   350    13004    Delivered
59        13006    2   iPad Mini   450    13004    Delivered
60        13006    3  Samsung S8   550    13004    Delivered
61        13001    1  Samsung S9   600    13005      Pending
62        13001    2    iPhone 7   650    13005      Pending
63        13002    1 Motorola X6   350    13005      Pending
64        13002    2   iPad Mini   450    13005      Pending
65        13002    3  Samsung S8   550    13005      Pending
66        13003    1  Samsung S9   600    13005      Pending
67        13004    1    iPhone 7   650    13005      Pending
68        13004    2 Motorola X6   350    13005      Pending
69        13004    3   iPad Mini   450    13005      Pending
70        13004    4  Samsung S8   550    13005      Pending
71        13005    1  Samsung S9   600    13005      Pending
72        13005    2    iPhone 7   650    13005      Pending
73        13006    1 Motorola X6   350    13005      Pending
74        13006    2   iPad Mini   450    13005      Pending
75        13006    3  Samsung S8   550    13005      Pending
76        13001    1  Samsung S9   600    13006      Pending
77        13001    2    iPhone 7   650    13006      Pending
78        13002    1 Motorola X6   350    13006      Pending
79        13002    2   iPad Mini   450    13006      Pending
80        13002    3  Samsung S8   550    13006      Pending
81        13003    1  Samsung S9   600    13006      Pending
82        13004    1    iPhone 7   650    13006      Pending
83        13004    2 Motorola X6   350    13006      Pending
84        13004    3   iPad Mini   450    13006      Pending
85        13004    4  Samsung S8   550    13006      Pending
86        13005    1  Samsung S9   600    13006      Pending
87        13005    2    iPhone 7   650    13006      Pending
88        13006    1 Motorola X6   350    13006      Pending
89        13006    2   iPad Mini   450    13006      Pending
90        13006    3  Samsung S8   550    13006      Pending

Does this make any sense ? The data has exploded. Why did R do this ? Since R does not know how to combine the data, it has done what is called as a “Cartesian Join” – For every row in table 1 it has combined all the data in the second row. In our case, this is basically useless.

Let’s tell R which column to merge by this time.

# x represents the first data frame and y the second
> order = merge(order_detail,orders,by.x = "order_number",by.y = "order_no")
> order
   order_number item     details price order_status
1         13001    1  Samsung S9   600    Delivered
2         13001    2    iPhone 7   650    Delivered
3         13002    1 Motorola X6   350      Pending
4         13002    2   iPad Mini   450      Pending
5         13002    3  Samsung S8   550      Pending
6         13003    1  Samsung S9   600    Delivered
7         13004    1    iPhone 7   650    Delivered
8         13004    2 Motorola X6   350    Delivered
9         13004    3   iPad Mini   450    Delivered
10        13004    4  Samsung S8   550    Delivered
11        13005    1  Samsung S9   600      Pending
12        13005    2    iPhone 7   650      Pending
13        13006    1 Motorola X6   350      Pending
14        13006    2   iPad Mini   450      Pending
15        13006    3  Samsung S8   550      Pending

Much better, isn’t it ? In fact, merging data need not just be based on just 1 column ( in this case order number ). It could be based on multiple columns – as long as the column names are properly specified and the data is correct.

For example, if there is another table (data frame ) that contains “refurbished” data at the line item level, like below.

> item_refurbished
   order_no item refurbished
1     13001    1         yes
2     13001    2          no
3     13002    1         yes
4     13002    2         yes
5     13002    3          no
6     13003    1         yes
7     13004    1         yes
8     13004    2         yes
9     13004    3          no
10    13004    4          no
11    13005    1          no
12    13005    2          no
13    13006    1          no
14    13006    2         yes
15    13006    3         yes

You can now combine the order_detail and item_refurbished data based on 2 columns –

  1. order_no
  2. item
> order_detail_all = merge(order_detail,item_refurbished)
> order_detail_all
   order_no item     details price refurbished
1     13001    1  Samsung S9   600         yes
2     13001    2    iPhone 7   650          no
3     13002    1 Motorola X6   350         yes
4     13002    2   iPad Mini   450         yes
5     13002    3  Samsung S8   550          no
6     13003    1  Samsung S9   600         yes
7     13004    1    iPhone 7   650         yes
8     13004    2 Motorola X6   350         yes
9     13004    3   iPad Mini   450          no
10    13004    4  Samsung S8   550          no
11    13005    1  Samsung S9   600          no
12    13005    2    iPhone 7   650          no
13    13006    1 Motorola X6   350          no
14    13006    2   iPad Mini   450         yes
15    13006    3  Samsung S8   550         yes

Based on the column names R was able to implicitly merge the data sets.

Scenario 4 – One data frame is bigger ( more rows ) than the other.

For example, some of the orders don’t have items – what happens in that case ?

# Let's just add another order to the orders table, but that row would not exist in the details table
> orders[7,] = c(13007,"Delivered")
> orders
  order_no order_status
1    13001    Delivered
2    13002      Pending
3    13003    Delivered
4    13004    Delivered
5    13005      Pending
6    13006      Pending
7    13007    Delivered
# When you do a default merge, the unmatched order ( 13007 ) will not come up. 
> orders_all = merge(orders,order_detail)
> orders_all
   order_no order_status item     details price
1     13001    Delivered    1  Samsung S9   600
2     13001    Delivered    2    iPhone 7   650
3     13002      Pending    1 Motorola X6   350
4     13002      Pending    2   iPad Mini   450
5     13002      Pending    3  Samsung S8   550
6     13003    Delivered    1  Samsung S9   600
7     13004    Delivered    1    iPhone 7   650
8     13004    Delivered    2 Motorola X6   350
9     13004    Delivered    3   iPad Mini   450
10    13004    Delivered    4  Samsung S8   550
11    13005      Pending    1  Samsung S9   600
12    13005      Pending    2    iPhone 7   650
13    13006      Pending    1 Motorola X6   350
14    13006      Pending    2   iPad Mini   450
15    13006      Pending    3  Samsung S8   550

If we want to include the new order, then the item details would not have any values against them. And the way to do it is via the all.x or all.y parameter.

# all.x or all.y is set to FALSE by default. That's why only "common" rows are merged.
> orders_all = merge(orders,order_detail,all.x = TRUE)
> orders_all
   order_no order_status item     details price
1     13001    Delivered    1  Samsung S9   600
2     13001    Delivered    2    iPhone 7   650
3     13002      Pending    1 Motorola X6   350
4     13002      Pending    2   iPad Mini   450
5     13002      Pending    3  Samsung S8   550
6     13003    Delivered    1  Samsung S9   600
7     13004    Delivered    1    iPhone 7   650
8     13004    Delivered    2 Motorola X6   350
9     13004    Delivered    3   iPad Mini   450
10    13004    Delivered    4  Samsung S8   550
11    13005      Pending    1  Samsung S9   600
12    13005      Pending    2    iPhone 7   650
13    13006      Pending    1 Motorola X6   350
14    13006      Pending    2   iPad Mini   450
15    13006      Pending    3  Samsung S8   550
16    13007    Delivered   NA        <NA>    NA

Scenario 5 – Merge by row names

This is a special case where data set A has some rows and data set B has another set of rows and we want to combine these ( more or less like a cbind ).The merging however is not based on common variables, but based on common row names.

For example, look at the inbuilt mtcars dataset.

> mtcars
                     mpg cyl  disp  hp drat    wt  qsec vs am gear carb
Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
..                  ..     ..      ..  ..   ..    ..     ..

Say, 2 teams are testing each of these cars and team A gives the first 5 variables and team B gives the remaining 6 variables. There is no common columns though. The only common parameter is the row names in the data frames.

# Let's split the mtcars into team A and team B variables
> team_a = mtcars[,1:5]
> team_a
                     mpg cyl  disp  hp drat
Mazda RX4           21.0   6 160.0 110 3.90
Mazda RX4 Wag       21.0   6 160.0 110 3.90
Datsun 710          22.8   4 108.0  93 3.85
Hornet 4 Drive      21.4   6 258.0 110 3.08
> team_b = mtcars[,6:11]
> team_b
                       wt  qsec vs am gear carb
Mazda RX4           2.620 16.46  0  1    4    4
Mazda RX4 Wag       2.875 17.02  0  1    4    4
Datsun 710          2.320 18.61  1  1    4    1
Hornet 4 Drive      3.215 19.44  1  0    3    1

If we have to merge these two datasets, it should be by row names ( since there are no common columns )

> mtcars_all = merge(team_a, team_b, by = "row.names")
> mtcars_all
             Row.names  mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1          AMC Javelin 15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
2   Cadillac Fleetwood 10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
3           Camaro Z28 13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
4    Chrysler Imperial 14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
5           Datsun 710 22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1

Dimensionality Reduction

Dimensionality Reduction


  Machine Learning in Python

Contents

What is Dimensionality Reduction

When you can do with just 2 variables, do you need 4 ? What do I mean by that ? Say you are trying to do classification using the iris dataset. You already know by now that there are 4 predictors

  • Sepal Length
  • Sepal Width
  • Petal Length
  • Petal Width

To classify the target species, you can use any of the classification methods that you studied so far. What if I tell you that you can get the same level of classification accuracy in determining the target species with just 2 predictors. Let me show you with an example.

# Load the dataset
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=100)
X_train[0:4,:]

array([[5.5, 2.4, 3.7, 1. ],
       [5.7, 2.8, 4.1, 1.3],
       [6. , 2.2, 5. , 1.5],
       [4.8, 3. , 1.4, 0.1]])
from sklearn import linear_model
from scipy.special import expit

model = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
model.score(X_test,y_test)
c:\program files (x86)\python37-32\lib\site-packages\sklearn\linear_model\logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.9666666666666667

That’s an accuracy of 96% with the standard dataset using Logistic Regression. Let’s test our claim ( to be able to get the same level of accuracy with just 2 variables).

# magic code
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
pca.fit(iris.data)

# Transofm the iris data to just 2 dimensions
iris_red = pca.transform(iris.data)
X_train, X_test, y_train, y_test = train_test_split(iris_red, iris.target, test_size=0.2, random_state=100)
X_train[0:4,:]

array([[-0.18962247, -0.68028676],
       [ 0.29900084, -0.34889781],
       [ 1.30079171, -0.76114964],
       [-2.78610927, -0.235112  ]])
# Now, let's fit the model with just these 2 variables.
model_red = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
model.score(X_test,y_test)

c:\program files (x86)\python37-32\lib\site-packages\sklearn\linear_model\logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.9666666666666667

voila!! That’s still a 96% accuracy. How did we achieve this magic with just 2 variables ?

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(X_train[:,0],X_train[:,1],c = y_train)

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(iris.data[:,0],iris.data[:,1],c = iris.target)

The ability to reduce the number of predictors(variables) while still preserving the predictive power of the original predictors is called Dimensionality Reduction

The question that comes up next is, why reduce dimensions ? In the case of the iris dataset, there are just 4 dimensions. So, there is not much of an obvious advantage. But imagine hundreds or thousands of dimentions – and it’s not an unrealistic problem either.

  • Simple 100 x 100 pixel images are actually a dataset with 10000 dimensions.

In cases like these, it makes sense to reduce the number of dimensions.

Hughes Phenomenon

In the late 70’s, G. Hughes observed a strange phonomenon while observing large datasets. It is best summarized by the picture below.

Intuitively, it would be easy to understand that the more inputs (features) you provide the model, the more the predictive power of the model. However, after a certain point the accuracy of the prediction drops off. This is the essence of Hughes Phenomonon. It is based on the fact that measuring data (features or variables) typically has some level of error to it. When you compound this error over a large number of variables, the error explodes so much that the accuracy is affected. }

Curse of Dimensionality

Also, in terms of the data required for accurate prediction, more dimensions mean more data. The phrase “Curse of Dimensionality” means

As the number of dimensions(or features) grow, the amount of data we need to generalize accurately grows exponentially.

The reason for this curse is becauase as the number of dimensions increase, the sparsity of the data increases. You can see this in the picture below that contrasts the data sparsity of 1-d vs 2-d.

More features doesn’t necessarily mean better accuracy.

The Solution

The solution to this problem is Dimensionality Reduction. However, this is not the same as feature selection. Look at ths picture below –

As you can see, the reduced dimensions are not the same as the original dimensions.

There are many ways in which dimensionality reduction can be done. Here are some of the ways

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Random Projection
  • Feature Agglomeration etc

Scikit learn supports some of the methods. However, the actual technique is not all that important – as long as you understand why it is used and apply it when required. In this section, we will try to understand just one of the ways in which dimensionality reduction can be done – PCA.

Principal Component Analysis

Principal Component Analysis is based on identifying dimensions(variables/predictors) with the largest variance and projecting the other variables onto those dimensions. For example, look at the picture below.

  • Left – Plot of age vs height to classify men vs women
  • Right – Plot of nationality vs height to classify men vs women

In the plot on the left, both height and weight seems to have a significant impact in classifying the person as male or female. On the other hand, the plot on the right only the dimension height seems to have an impact in classifying the person. Nationality does not have any impact at all – virtually there is just one predictor – height. It is very visual in this case.

PCA leverages this and approximates this to reduce dimensions. The metric PCA uses is variance – specifically co-variance. Let’s understand this a bit more mathematically.

Variance 

In a previous section ( Probability and Statistics ) we have seen what variance is. It is a measure of how spread out the data is. Principal components are the directions in which there is the the most variance. Let’s see an example.

Eigen Vectors and Eigen Values

Think of Eigen Value as the variance in the direction of the Eigen Vector. The math behind PCA is to find the correct Eigen Vector (Where the Eigen Value is maximum). Let’s not get into the math behind finding the actual Eigen Vector.

Once the data is projected the data onto the Principal Component (the correct Eigen Vector), you basically have reduced the number of dimensions from two to one.

Scikit Learn also has multiple variations to PCA – For example, Kernel PCA is an algorithm that can do dimensionality reduction using kernels. For example, data that can’t be linearly separated like a circle can also be separated.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles

np.random.seed(0)

X, y = make_circles(n_samples=400, factor=.3, noise=.05)

plt.scatter(X[:, 0], X[:, 1], c= y)
pca = KernelPCA(n_components = 1)
pca.fit(X)
new_data_kernel = pca.transform(X)
plt.scatter(new_data_kernel,y,c=y)

Associative Rule Learning

Associative Rule Learning


  Machine Learning in Python

Contents

What is Association Rule Learning

Associative Rule Learning (or mining) is a Machine Learning Algorithm for discovering relationship between variables. What is new about this, you must be wondering ? Standarad statistical methods like correlation or regression also do the same thing, right ?

Well, for beginners, thos are typically either supervised algorithms or a way to quantify relationship between a known set of variables – For example, find the relationship between

  • smoking and cancer
  • cholesterol and heart disease etc

Associative Rule Learning on the other hand discovers or learns relationships between variables that you might not be aware of. That is why it is classified as an unsupervised Machine Learning Algorithm. This was first discovered in 1993 when a group of researchers were interested in finding out the relationship between items sold in supermarkets based on data got from their Point-of-Sale systems. Here are two classic examples.

  • the classic example of an unusual relationship that is hard to miss for human intuition is the relationship between Baby Diapers and beer in supermarket sales.
  • Another example of doing this on a large scale is movie recommender systems in Netflix, Amazon Prime Video. Even if you have not experienced Netflix or Amazon Prime Video, you must have already experienced this as part of Youtube video recommendations. It is pretty accurate actually.

Key Terms

Before we get into the actual algorithms, let’s understand a couple of key terms

  • Support
  • Lift
  • Confidence

Once we understand these terms, we can move on to the actual algoritm itself.

Imagine I have made the following transactions in a super market over the course of a week.

Txn ID – Items

  • 154265 – { Milk }
  • 645858 – { Milk, Cheese }
  • 588455 – { Milk, Water, Vegetables }
  • 554855 – { Milk, Cheese, Yoghurt, Water }
  • 558965 – { Water, Vegetables

Say we are trying to quantify the association (or rule) between the items Milk and Cheese. Specifically the association

  • Milk -> Cheese

and not the other way around ( NOT Cheese -> Milk ). Meaning, we are trying to quantify the association that implies that I buy Cheese if I already have Milk in my basket.

Support

Support is a measure of how frequent a item or an item set appears in a dataset. For example, what is the support for the item set { Milk + Cheese } ?

  • 1 – { Milk }
  • 2 – { Milk, Cheese }
  • 3 – { Milk, Water, Vegetables }
  • 4 – { Milk, Cheese, Yoghurt, Water }
  • 5 – { Water, Vegetables }

Confidence

Confidence is a measure of how often this rule is found to be true. It is defined as follows.

For example, in our case, the Confidence for the combination { Milk -> Cheese } would be

Lift

Lift of a rule is defined as following.

Now that we have got the math behind us, let’s define in simple terms what these terms mean.

  • SupportX−>YSupportX−>Y – How frequent is this combination ? This is relatively straight forward – it is quite simply the total occurances of the combination in the entire transactions.
  • ConfidenceX−>YConfidenceX−>Y – How often is this combination true ? Or, how likely is it that Y is purchased when X is purchased.
  • LiftX−>YLiftX−>Y – Defines the strength of the relationship.
    • Lift = 1
      • P(X) = P(Y) – meaning both the events are unrelated.
    • Lift > 1
      • X is very related to Y . For example, in the example above, since Lift > 1 , it means that Milk is very strongly associated with Cheese or in other words, Milk & Cheese occur together more often than separately.
    • Lift < 1
      • X and Y have a negative relationship. In the case of Milk & Cheese above, if Lift was < 1, then Milk would NOT occur together with Cheese.

Now that we have got the math behind us, let’s go on to the implementation of the actual algorithms. We are going to focus now on just 2 of the rule mining Algorithms

  • Apriori
  • Eclact

Apriori Algorithm

Apriori is an algorithm that combs through large datasets to identify different rules(associations). At a high level this is how it works.

  • Step 1 – Identify single items and how frequently they occur – Call this set 1. To reduce the complexity, we typically set a minimum support level.
    • For example, in a supermarket dataset, this dataset identifies all the invidual items (Milk, Cheese, Water etc) and how frequently they occur. If some items (say exotic spices) are not all that frequent, they are removed from this set (not from the actual dataset)

set 1 

Say we set a cut-off of at a frequency count of 2, we will only be left with 4 items (leave out Yoghurt).

  • Step 2 – Prepare all 2-item combinations of items in set 1 . Once again go through the original dataset to find frequency of occurance of each of the 2-item combinations. Once again to reduce the complexity, set a minimum support level.
    • _For example, among all the items, we have identified in set 1 above that {milk, cheese, water and vegetables} occur in frequency of at least 40%. Now, identify all 2-set combinations

set 2 
Item set – Frequency 

Once again, with a cut-off of 2, only 2 item sets remain 

{Milk, Cheese}
{Water, Vegetables}

  • Step 3 – Increase the combination size from set 2 and repeat step 3 recursively until no more sets are found. 

Implementation

Scikit Learn does not have Association Rule mining algorithms. Luckily, there are many implementations of Apriori Algorithms in standard python. For example, one of the packages is MLxtend available as a standard python package that you can install using pip.

pip install mlextend

You can find more details on the implementation here

Let’s first increase the size of the dataset.University of California, Irvine (UCI) has a public version of an online retail dataset from UK. You can get it from here. We have also downloaded it here for you in the data directory. It is a 25 MB dataset that contains data from many different countries in the EU and Asia.

Different implementations require data to be in a different format. For example, mlextend requires the data to be in a sparse matrix format. Let’s do the required steps below.

  • Read the dataset
  • Process the dataset into sparse matrix format
  • Run Apriori Algorithm

1. Read the Dataset

import pandas as pd

data = pd.read_excel('./data/online_retail.xlsx')
data.head()

                 InvoiceNo	StockCode	   Description	                                                   Quantity	InvoiceDate	                  UnitPriceCustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	                                      6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	                   8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	                   6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

2. Process the data into a Sparse Matrix

  • Just in case Invoice Number is not a string, just convert it to a string.
data["Description"] = data["Description"].astype(str)
data["InvoiceNo"] = data["InvoiceNo"].astype(str)

  • Drop all blanks
data = data.dropna()
  • We are specifically interested in 3 columns
    • Invoice Number
    • Description
    • Quantity

Since there is data related to many different countries, let’s restrict ourselves to just one country – say Germany. You could pick any country you like.

# Filter out just German data
data_de = data[data["Country"] == "Germany"]

In this dataset example, each item in the invoice(order) is spread out over one line. So, we have to essentially compress all the line items per invoice.

data_de = data_de.groupby(["InvoiceNo","Description"])["Quantity"].sum()
data_de.head(12)

InvoiceNo  Description                        
536527     3 HOOK HANGER MAGIC GARDEN             12
           5 HOOK HANGER MAGIC TOADSTOOL          12
           5 HOOK HANGER RED MAGIC TOADSTOOL      12
           ASSORTED COLOUR LIZARD SUCTION HOOK    24
           CHILDREN'S CIRCUS PARADE MUG           12
           HOMEMADE JAM SCENTED CANDLES           12
           HOT WATER BOTTLE BABUSHKA               4
           JUMBO BAG OWLS                         10
           JUMBO BAG WOODLAND ANIMALS             10
           MULTI COLOUR SILVER T-LIGHT HOLDER     12
           PACK 3 FIRE ENGINE/CAR PATCHES         12
           PICTURE DOMINOES                       12
Name: Quantity, dtype: int64

This is where we turn the dataframe sideways (into a sparse matrix).

data_de = data_de.unstack()
data_de.head()

Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536840	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536861	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536967	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536983	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5 rows × 1703 columns

Wherever there is no quantity, Pandas just puts a NaN. The algorithm wants a zero in that case. So, let’s fill all NaN’s with 0.

data_de = data_de.fillna(0)
data_de.head()
Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536840	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536861	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536967	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536983	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 1703 columns

# data_de = data_de.set_index("InvoiceNo")
# data_de.head()

Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536840	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536861	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536967	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536983	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5 rows × 1703 columns

5 rows × 1703 columns

The bulk of the cells in the dataframe are the summation of quantities associated with a particular invoice. In case the quantity > 0, the algorithm requires us to reset it to 0. Basically, each cell is expected to be binary – either a 0 or 1. Let’s first write a function for that.

def reduce_to_binary(qty) : 
    if qty >= 1 :
        return 1
    if qty <= 0 :
        return 0

Now, let’s apply it to all the cells in the dataframe.

data_de = data_de.applymap(reduce_to_binary)
data_de.head()
Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536840	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536861	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536967	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536983	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 1703 columns

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

frequent_items = apriori(data_de, min_support=0.07, use_colnames=True)
frequent_itemsets = apriori(data_de, min_support=0.07, use_colnames=True)

rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()

antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(POSTAGE)	0.087894	0.620232	0.076285	0.867925	1.399354	0.021771	2.875385
1	(POSTAGE)	(PLASTERS IN TIN CIRCUS PARADE )	0.620232	0.087894	0.076285	0.122995	1.399354	0.021771	1.040023
2	(PLASTERS IN TIN SPACEBOY)	(POSTAGE)	0.081260	0.620232	0.076285	0.938776	1.513587	0.025885	6.202875
3	(POSTAGE)	(PLASTERS IN TIN SPACEBOY)	0.620232	0.081260	0.076285	0.122995	1.513587	0.025885	1.047587
4	(PLASTERS IN TIN WOODLAND ANIMALS)	(POSTAGE)	0.104478	0.620232	0.089552	0.857143	1.381971	0.024752	2.658375

Looks like there is an item called “POSTAGE” that is basically the Shipping Costs. We don’t need that, right ? Let’s get rid of it and run it again.

data_de = data_de.drop(columns=['POSTAGE'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-286-bf0f03a3e119> in <module>
----> 1 data_de = data_de.drop(columns=['POSTAGE'])

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3938                                            index=index, columns=columns,
   3939                                            level=level, inplace=inplace,
-> 3940                                            errors=errors)
   3941 
   3942     @rewrite_axis_style_signature('mapper', [('copy', True),

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3778         for axis, labels in axes.items():
   3779             if labels is not None:
-> 3780                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   3781 
   3782         if inplace:

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
   3810                 new_axis = axis.drop(labels, level=level, errors=errors)
   3811             else:
-> 3812                 new_axis = axis.drop(labels, errors=errors)
   3813             result = self.reindex(**{axis_name: new_axis})
   3814 

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
   4963             if errors != 'ignore':
   4964                 raise KeyError(
-> 4965                     '{} not found in axis'.format(labels[mask]))
   4966             indexer = indexer[~mask]
   4967         return self.delete(indexer)

KeyError: "['POSTAGE'] not found in axis"
frequent_items = apriori(data_de, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()

antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.087894	0.104478	0.051410	0.584906	5.598383	0.042227	2.157395
1	(PLASTERS IN TIN WOODLAND ANIMALS)	(PLASTERS IN TIN CIRCUS PARADE )	0.104478	0.087894	0.051410	0.492063	5.598383	0.042227	1.795709
2	(PLASTERS IN TIN WOODLAND ANIMALS)	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.104478	0.185738	0.056385	0.539683	2.905612	0.036979	1.768914
3	(ROUND SNACK BOXES SET OF4 WOODLAND )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.185738	0.104478	0.056385	0.303571	2.905612	0.036979	1.285878
4	(ROUND SNACK BOXES SET OF 4 FRUITS )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.119403	0.185738	0.099502	0.833333	4.486607	0.077325	4.885572

Now, we are getting somewhere. We have specifically chosen a minimum support (min_support) of 0.05 – meaning, only look at items that occur at least 5% of the time in the datasets. You can case your net wide and capture rare items as well if you want.

However, we are more interested specifically in the lift column. The more the lift , the more the relevance. So, let’s sort by “lift”.

rules.sort_values(by = ["lift"],ascending=False)
antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.087894	0.104478	0.051410	0.584906	5.598383	0.042227	2.157395
1	(PLASTERS IN TIN WOODLAND ANIMALS)	(PLASTERS IN TIN CIRCUS PARADE )	0.104478	0.087894	0.051410	0.492063	5.598383	0.042227	1.795709
5	(ROUND SNACK BOXES SET OF4 WOODLAND )	(ROUND SNACK BOXES SET OF 4 FRUITS )	0.185738	0.119403	0.099502	0.535714	4.486607	0.077325	1.896670
4	(ROUND SNACK BOXES SET OF 4 FRUITS )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.119403	0.185738	0.099502	0.833333	4.486607	0.077325	4.885572
6	(SPACEBOY LUNCH BOX )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.077944	0.185738	0.053068	0.680851	3.665653	0.038591	2.551354
7	(ROUND SNACK BOXES SET OF4 WOODLAND )	(SPACEBOY LUNCH BOX )	0.185738	0.077944	0.053068	0.285714	3.665653	0.038591	1.290879
3	(ROUND SNACK BOXES SET OF4 WOODLAND )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.185738	0.104478	0.056385	0.303571	2.905612	0.036979	1.285878
2	(PLASTERS IN TIN WOODLAND ANIMALS)	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.104478	0.185738	0.056385	0.539683	2.905612	0.036979	1.768914

Clustering

Clustering


  Machine Learning in Python

Contents

What is Clustering

Imagine weights of the students of three different classes (grades). Just to make the weight groups clearly distinguisable from each other, let’s pick classes a bit far from each other – say 1st, 4th and 7th.

# code to reproduce the data

import numpy as np

# set the seed to make this reproducible
np.random.seed(100)

# samples per bucket
size = 5

# generate 10 numbers in a normal distribution with a set mean and sd.
# convert them into integers for simplicity

weights_1 = np.around(np.random.normal(20, 5, size)).astype(int)
weights_4 = np.around(np.random.normal(40, 5, size)).astype(int)
weights_7 = np.around(np.random.normal(70, 5, size)).astype(int)

Let’s plot the data and mark that these are different sets of data.

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights_1, np.zeros(size),alpha=0.5)
plt.scatter(weights_4, np.zeros(size),alpha=0.5)
plt.scatter(weights_7, np.zeros(size),alpha=0.5)

# draw vertical lines at the mean points to show that these are three distinct groups
plt.axvline(20, linestyle="--",alpha=0.5, color="blue")
plt.axvline(40, linestyle="--",alpha=0.5, color="orange")
plt.axvline(70, linestyle="--",alpha=0.5, color="green")

The data seems pretty straight forward (now that we have colored and marked the mean. But clustering is the process of grouping data points into clusters automatically. Imagine giving this data to somebody without telling them the specific buckets – something like this.

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights_1, np.zeros(size),alpha=0.5, color="gray")
plt.scatter(weights_4, np.zeros(size),alpha=0.5, color="gray")
plt.scatter(weights_7, np.zeros(size),alpha=0.5, color="gray")

How would that person cluster the data into buckets ? Since this is data in just one dimension, human mind can intuitively do a whole lot of calculation. For example, just looking at the data anybody would be able to make out the following groups.

A couple of points here and there, but mostly the conclusion is pretty reasonable. If a machine does the same thing it is called clustering. There are many algorithms for clustering.

  • K-means clustering
  • Mean shift clustering
  • Hierarchical clustering etc

k-means clustering

Let’s see how k-means clustering can cluster this data automatically. Here is the algorithm steps.

  • Step 1 – Pick k points – Call them cluster centroids. At this poitn they are just random.
  • Step 2 – For each point in the dataset, find out the closest cluster centroid. Asssign that point to that cluster. After finishing step no. 2, each of the points belong to one of the k clusters.
  • Step 3 – Calculate the new centroid/mean of the cluster. Move the initial cluster centroids ( randomly set in step 1 ) to the new centroid.

Continue Step 2 and Step 3 until the centroids no longer change.

Let’s do this manually for just two of the weight groups

  • weights_1
  • weights_7
# combine these arrays into a single vector.
weights = np.concatenate((weights_1.ravel(), weights_7.ravel()))
weights
array([11, 22, 26, 19, 25, 68, 72, 67, 74, 73])
# Step 1
# pick 2 centroids at random. 
from random import randrange

centroid_1 = randrange( np.amin(weights),  np.amax(weights) )
centroid_2 = randrange( np.amin(weights),  np.amax(weights) )

centroid_1
65
centroid_2
55
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(centroid_1,0,color="green", marker="*")
plt.scatter(centroid_2,0,color="red", marker="*")

Step 2 – Mark each point as closer to one cluster point or the other.

Visually, you can see that all the points to the left are closer to the red dot and all the points to the right are closer to the gree dot.

Step 3 – Calculate the new centroids.

weights = np.sort(weights)
weights

array([11, 19, 22, 25, 26, 67, 68, 72, 73, 74])
centroid_1 = np.array ([11, 19, 22, 25, 26,55]).mean()
centroid_2 = np.array ([ 67, 68, 72, 73, 74, 65]).mean()

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(centroid_1,0,color="red", marker="*")
plt.scatter(centroid_2,0,color="green", marker="*")

plt.scatter(55,0,color="red", marker="*", alpha=0.5)
plt.scatter(65,0,color="green", marker="*",alpha=0.5)

plt.annotate("old centroid", (55,0), arrowprops= dict(arrowstyle="->"),xytext=(60,0.002),color="red");
plt.annotate("new centroid", (centroid_1,0), arrowprops= dict(arrowstyle="->"),xytext=(30,0.002),color="red");

plt.annotate("new centroid", (centroid_2,0), arrowprops= dict(arrowstyle="->"),xytext=(60,-0.002),color="green");
plt.annotate("old centroid", (65,0), arrowprops= dict(arrowstyle="->"),xytext=(40,-0.002),color="green");

centroid_1_third = np.array ([11, 19, 22, 25, 26,centroid_1]).mean()
centroid_2_third = np.array ([ 67, 68, 72, 73, 74, centroid_2]).mean()

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(centroid_1,0,color="red", marker="*")
plt.scatter(centroid_2,0,color="green", marker="*")

# centroid 1
plt.scatter(55,0,color="red", marker="*", alpha=0.2)
plt.scatter(65,0,color="green", marker="*",alpha=0.2)

# centroid 2
plt.scatter(centroid_1,0,color="red", marker="*", alpha=0.5)
plt.scatter(centroid_2,0,color="green", marker="*",alpha=0.5)

#centroid 3
plt.scatter(centroid_1_third,0,color="red", marker="*")
plt.scatter(centroid_2_third,0,color="green", marker="*")

plt.annotate("new centroid", (centroid_1_third,0), arrowprops= dict(arrowstyle="->"),xytext=(30,0.002),color="red");
plt.annotate("new centroid", (centroid_2_third,0), arrowprops= dict(arrowstyle="->"),xytext=(60,-0.002),color="green");

Do this repeatedly until the centroid no longer moves. That’s it – That is k means clustering. Now, let’s do it in python using scikit learn.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0).fit(weights.reshape(-1,1))

kmeans.cluster_centers_
array([[70.8],
       [20.6]])

There you go, those are the clusters. Pretty close to our picture above, right ? Now, we can try to predict a point – say we want to where the following points fall ?

  • 30
  • 60
  • 90
  • 5
  • 40
  • 45
data = np.array([30,60,90,5,40,45]).reshape(-1,1)

predict_class = kmeans.predict(data)
predict_class
array([1, 0, 0, 1, 1, 1])
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(data,np.zeros(len(data)), c=predict_class)

All the yellow dots belong to the first cluster and the dark blue dots belong to the second cluster.

Here is a quick visual of a smaller dataset in 2 dimensional space. What I want you to focus on is how the centroid moves with each iteration.

Optimum k value

Now, for the most important question, how do you find out what is the optimum K value. For example, in the previous example, we told the machine to classify the data into 2 clusters. Most of the time, you wouldn’t know the number of clusters. In cases like that you need to find out what is the optimum k value. That is where the sum of squared distances to the cluster comes in. It is also called intertia. It is available readymade as an attribute of the clustering model.

kmeans.inertia_
184.0

Now, to find out the optimum k value, let’s iterate over 1 to 10 to find out the optimum intertia or sum of squares.

# Hold the interia or sum of squard distances in this array
sum_squares = []

for i in range(1,10) :
    model = KMeans(n_clusters = i, max_iter = 100)
    model.fit(weights.reshape(-1,1))
    
    sum_squares.append(model.inertia_)
plt.plot( range(1,10), sum_squares)
plt.xlabel(" K value ")
plt.ylabel("Sum of Squared distances or Inertia")
Text(0, 0.5, 'Sum of Squared distances or Inertia')

This is called as an elbow plot – looks pretty much like an elbow, right ? Beyond the bend in the elbow (inflexion point), after a k-value of 2, there is no more decrease in sum of squared distances or interia. So, the optimum k value in this case is 2 – which is correct.

Let’s add the third weight category as well and see if this changes.

# combine these arrays into a single vector.
weights = np.concatenate((weights_1.ravel(), weights_4.ravel(), weights_7.ravel()))

# Hold the interia or sum of squard distances in this array
sum_squares = []

for i in range(1,10) :
    model = KMeans(n_clusters = i, max_iter = 100)
    model.fit(weights.reshape(-1,1))
    
    sum_squares.append(model.inertia_)
    
plt.plot( range(1,10), sum_squares)
plt.xlabel(" K value ")
plt.ylabel("Sum of Squared distances or Inertia")

Text(0, 0.5, 'Sum of Squared distances or Inertia')

It does – Now the elbow point moved to a k value of 3. Which proves that the elbow method works. Yay !!