Association Rule Learning in Python

Associative Rule Learning


  Machine Learning in Python

Contents

What is Association Rule Learning

Associative Rule Learning (or mining) is a Machine Learning Algorithm for discovering relationship between variables. What is new about this, you must be wondering ? Standarad statistical methods like correlation or regression also do the same thing, right ?

Well, for beginners, thos are typically either supervised algorithms or a way to quantify relationship between a known set of variables – For example, find the relationship between

  • smoking and cancer
  • cholesterol and heart disease etc

Associative Rule Learning on the other hand discovers or learns relationships between variables that you might not be aware of. That is why it is classified as an unsupervised Machine Learning Algorithm. This was first discovered in 1993 when a group of researchers were interested in finding out the relationship between items sold in supermarkets based on data got from their Point-of-Sale systems. Here are two classic examples.

  • the classic example of an unusual relationship that is hard to miss for human intuition is the relationship between Baby Diapers and beer in supermarket sales.
  • Another example of doing this on a large scale is movie recommender systems in Netflix, Amazon Prime Video. Even if you have not experienced Netflix or Amazon Prime Video, you must have already experienced this as part of Youtube video recommendations. It is pretty accurate actually.

Key Terms

Before we get into the actual algorithms, let’s understand a couple of key terms

  • Support
  • Lift
  • Confidence

Once we understand these terms, we can move on to the actual algoritm itself.

Imagine I have made the following transactions in a super market over the course of a week.

Txn ID – Items

  • 154265 – { Milk }
  • 645858 – { Milk, Cheese }
  • 588455 – { Milk, Water, Vegetables }
  • 554855 – { Milk, Cheese, Yoghurt, Water }
  • 558965 – { Water, Vegetables

Say we are trying to quantify the association (or rule) between the items Milk and Cheese. Specifically the association

  • Milk -> Cheese

and not the other way around ( NOT Cheese -> Milk ). Meaning, we are trying to quantify the association that implies that I buy Cheese if I already have Milk in my basket.

Support

Support is a measure of how frequent a item or an item set appears in a dataset. For example, what is the support for the item set { Milk + Cheese } ?

  • 1 – { Milk }
  • 2 – { Milk, Cheese }
  • 3 – { Milk, Water, Vegetables }
  • 4 – { Milk, Cheese, Yoghurt, Water }
  • 5 – { Water, Vegetables }

Confidence

Confidence is a measure of how often this rule is found to be true. It is defined as follows.

For example, in our case, the Confidence for the combination { Milk -> Cheese } would be

Lift

Lift of a rule is defined as following.

Now that we have got the math behind us, let’s define in simple terms what these terms mean.

  • SupportX−>YSupportX−>Y – How frequent is this combination ? This is relatively straight forward – it is quite simply the total occurances of the combination in the entire transactions.
  • ConfidenceX−>YConfidenceX−>Y – How often is this combination true ? Or, how likely is it that Y is purchased when X is purchased.
  • LiftX−>YLiftX−>Y – Defines the strength of the relationship.
    • Lift = 1
      • P(X) = P(Y) – meaning both the events are unrelated.
    • Lift > 1
      • X is very related to Y . For example, in the example above, since Lift > 1 , it means that Milk is very strongly associated with Cheese or in other words, Milk & Cheese occur together more often than separately.
    • Lift < 1
      • X and Y have a negative relationship. In the case of Milk & Cheese above, if Lift was < 1, then Milk would NOT occur together with Cheese.

Now that we have got the math behind us, let’s go on to the implementation of the actual algorithms. We are going to focus now on just 2 of the rule mining Algorithms

  • Apriori
  • Eclact

Apriori Algorithm

Apriori is an algorithm that combs through large datasets to identify different rules(associations). At a high level this is how it works.

  • Step 1 – Identify single items and how frequently they occur – Call this set 1. To reduce the complexity, we typically set a minimum support level.
    • For example, in a supermarket dataset, this dataset identifies all the invidual items (Milk, Cheese, Water etc) and how frequently they occur. If some items (say exotic spices) are not all that frequent, they are removed from this set (not from the actual dataset)

set 1 

Say we set a cut-off of at a frequency count of 2, we will only be left with 4 items (leave out Yoghurt).

  • Step 2 – Prepare all 2-item combinations of items in set 1 . Once again go through the original dataset to find frequency of occurance of each of the 2-item combinations. Once again to reduce the complexity, set a minimum support level.
    • _For example, among all the items, we have identified in set 1 above that {milk, cheese, water and vegetables} occur in frequency of at least 40%. Now, identify all 2-set combinations

set 2 
Item set – Frequency 

Once again, with a cut-off of 2, only 2 item sets remain 

{Milk, Cheese}
{Water, Vegetables}

  • Step 3 – Increase the combination size from set 2 and repeat step 3 recursively until no more sets are found. 

Implementation

Scikit Learn does not have Association Rule mining algorithms. Luckily, there are many implementations of Apriori Algorithms in standard python. For example, one of the packages is MLxtend available as a standard python package that you can install using pip.

pip install mlextend

You can find more details on the implementation here

Let’s first increase the size of the dataset.University of California, Irvine (UCI) has a public version of an online retail dataset from UK. You can get it from here. We have also downloaded it here for you in the data directory. It is a 25 MB dataset that contains data from many different countries in the EU and Asia.

Different implementations require data to be in a different format. For example, mlextend requires the data to be in a sparse matrix format. Let’s do the required steps below.

  • Read the dataset
  • Process the dataset into sparse matrix format
  • Run Apriori Algorithm

1. Read the Dataset

import pandas as pd

data = pd.read_excel('./data/online_retail.xlsx')
data.head()

                 InvoiceNo	StockCode	   Description	                                                   Quantity	InvoiceDate	                  UnitPriceCustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	                                      6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	                   8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	                   6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

2. Process the data into a Sparse Matrix

  • Just in case Invoice Number is not a string, just convert it to a string.
data["Description"] = data["Description"].astype(str)
data["InvoiceNo"] = data["InvoiceNo"].astype(str)

  • Drop all blanks
data = data.dropna()
  • We are specifically interested in 3 columns
    • Invoice Number
    • Description
    • Quantity

Since there is data related to many different countries, let’s restrict ourselves to just one country – say Germany. You could pick any country you like.

# Filter out just German data
data_de = data[data["Country"] == "Germany"]

In this dataset example, each item in the invoice(order) is spread out over one line. So, we have to essentially compress all the line items per invoice.

data_de = data_de.groupby(["InvoiceNo","Description"])["Quantity"].sum()
data_de.head(12)

InvoiceNo  Description                        
536527     3 HOOK HANGER MAGIC GARDEN             12
           5 HOOK HANGER MAGIC TOADSTOOL          12
           5 HOOK HANGER RED MAGIC TOADSTOOL      12
           ASSORTED COLOUR LIZARD SUCTION HOOK    24
           CHILDREN'S CIRCUS PARADE MUG           12
           HOMEMADE JAM SCENTED CANDLES           12
           HOT WATER BOTTLE BABUSHKA               4
           JUMBO BAG OWLS                         10
           JUMBO BAG WOODLAND ANIMALS             10
           MULTI COLOUR SILVER T-LIGHT HOLDER     12
           PACK 3 FIRE ENGINE/CAR PATCHES         12
           PICTURE DOMINOES                       12
Name: Quantity, dtype: int64

This is where we turn the dataframe sideways (into a sparse matrix).

data_de = data_de.unstack()
data_de.head()

Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536840	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5 rows × 1703 columns

Wherever there is no quantity, Pandas just puts a NaN. The algorithm wants a zero in that case. So, let’s fill all NaN’s with 0.

data_de = data_de.fillna(0)
data_de.head()
Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536840	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 1703 columns

# data_de = data_de.set_index("InvoiceNo")
# data_de.head()

Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536840	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5 rows × 1703 columns

5 rows × 1703 columns

The bulk of the cells in the dataframe are the summation of quantities associated with a particular invoice. In case the quantity > 0, the algorithm requires us to reset it to 0. Basically, each cell is expected to be binary – either a 0 or 1. Let’s first write a function for that.

def reduce_to_binary(qty) : 
    if qty >= 1 :
        return 1
    if qty <= 0 :
        return 0

Now, let’s apply it to all the cells in the dataframe.

data_de = data_de.applymap(reduce_to_binary)
data_de.head()
Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536840	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 1703 columns

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

frequent_items = apriori(data_de, min_support=0.07, use_colnames=True)
frequent_itemsets = apriori(data_de, min_support=0.07, use_colnames=True)

rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()

antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(POSTAGE)	0.087894	0.620232	0.076285	0.867925	1.399354	0.021771	2.875385
1	(POSTAGE)	(PLASTERS IN TIN CIRCUS PARADE )	0.620232	0.087894	0.076285	0.122995	1.399354	0.021771	1.040023
2	(PLASTERS IN TIN SPACEBOY)	(POSTAGE)	0.081260	0.620232	0.076285	0.938776	1.513587	0.025885	6.202875
3	(POSTAGE)	(PLASTERS IN TIN SPACEBOY)	0.620232	0.081260	0.076285	0.122995	1.513587	0.025885	1.047587
4	(PLASTERS IN TIN WOODLAND ANIMALS)	(POSTAGE)	0.104478	0.620232	0.089552	0.857143	1.381971	0.024752	2.658375

Looks like there is an item called “POSTAGE” that is basically the Shipping Costs. We don’t need that, right ? Let’s get rid of it and run it again.

data_de = data_de.drop(columns=['POSTAGE'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-286-bf0f03a3e119> in <module>
----> 1 data_de = data_de.drop(columns=['POSTAGE'])
...
KeyError: "['POSTAGE'] not found in axis"
frequent_items = apriori(data_de, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()

antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.087894	0.104478	0.051410	0.584906	5.598383	0.042227	2.157395
1	(PLASTERS IN TIN WOODLAND ANIMALS)	(PLASTERS IN TIN CIRCUS PARADE )	0.104478	0.087894	0.051410	0.492063	5.598383	0.042227	1.795709
2	(PLASTERS IN TIN WOODLAND ANIMALS)	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.104478	0.185738	0.056385	0.539683	2.905612	0.036979	1.768914
3	(ROUND SNACK BOXES SET OF4 WOODLAND )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.185738	0.104478	0.056385	0.303571	2.905612	0.036979	1.285878
4	(ROUND SNACK BOXES SET OF 4 FRUITS )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.119403	0.185738	0.099502	0.833333	4.486607	0.077325	4.885572

Now, we are getting somewhere. We have specifically chosen a minimum support (min_support) of 0.05 – meaning, only look at items that occur at least 5% of the time in the datasets. You can case your net wide and capture rare items as well if you want.

However, we are more interested specifically in the lift column. The more the lift , the more the relevance. So, let’s sort by “lift”.

rules.sort_values(by = ["lift"],ascending=False)
antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.087894	0.104478	0.051410	0.584906	5.598383	0.042227	2.157395
1	(PLASTERS IN TIN WOODLAND ANIMALS)	(PLASTERS IN TIN CIRCUS PARADE )	0.104478	0.087894	0.051410	0.492063	5.598383	0.042227	1.795709
5	(ROUND SNACK BOXES SET OF4 WOODLAND )	(ROUND SNACK BOXES SET OF 4 FRUITS )	0.185738	0.119403	0.099502	0.535714	4.486607	0.077325	1.896670
4	(ROUND SNACK BOXES SET OF 4 FRUITS )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.119403	0.185738	0.099502	0.833333	4.486607	0.077325	4.885572
6	(SPACEBOY LUNCH BOX )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.077944	0.185738	0.053068	0.680851	3.665653	0.038591	2.551354
7	(ROUND SNACK BOXES SET OF4 WOODLAND )	(SPACEBOY LUNCH BOX )	0.185738	0.077944	0.053068	0.285714	3.665653	0.038591	1.290879
3	(ROUND SNACK BOXES SET OF4 WOODLAND )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.185738	0.104478	0.056385	0.303571	2.905612	0.036979	1.285878
2	(PLASTERS IN TIN WOODLAND ANIMALS)	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.104478	0.185738	0.056385	0.539683	2.905612	0.036979	1.768914