Associative Rule Learning
Contents
What is Association Rule Learning
Associative Rule Learning (or mining) is a Machine Learning Algorithm for discovering relationship between variables. What is new about this, you must be wondering ? Standarad statistical methods like correlation or regression also do the same thing, right ?
Well, for beginners, thos are typically either supervised algorithms or a way to quantify relationship between a known set of variables – For example, find the relationship between
- smoking and cancer
- cholesterol and heart disease etc
Associative Rule Learning on the other hand discovers or learns relationships between variables that you might not be aware of. That is why it is classified as an unsupervised Machine Learning Algorithm. This was first discovered in 1993 when a group of researchers were interested in finding out the relationship between items sold in supermarkets based on data got from their Point-of-Sale systems. Here are two classic examples.
- the classic example of an unusual relationship that is hard to miss for human intuition is the relationship between Baby Diapers and beer in supermarket sales.
- Another example of doing this on a large scale is movie recommender systems in Netflix, Amazon Prime Video. Even if you have not experienced Netflix or Amazon Prime Video, you must have already experienced this as part of Youtube video recommendations. It is pretty accurate actually.
Key Terms
Before we get into the actual algorithms, let’s understand a couple of key terms
- Support
- Lift
- Confidence
Once we understand these terms, we can move on to the actual algoritm itself.
Imagine I have made the following transactions in a super market over the course of a week.
Txn ID – Items
- 154265 – { Milk }
- 645858 – { Milk, Cheese }
- 588455 – { Milk, Water, Vegetables }
- 554855 – { Milk, Cheese, Yoghurt, Water }
- 558965 – { Water, Vegetables
Say we are trying to quantify the association (or rule) between the items Milk and Cheese. Specifically the association
- Milk -> Cheese
and not the other way around ( NOT Cheese -> Milk ). Meaning, we are trying to quantify the association that implies that I buy Cheese if I already have Milk in my basket.
Support
Support is a measure of how frequent a item or an item set appears in a dataset. For example, what is the support for the item set { Milk + Cheese } ?
- 1 – { Milk }
- 2 – { Milk, Cheese }
- 3 – { Milk, Water, Vegetables }
- 4 – { Milk, Cheese, Yoghurt, Water }
- 5 – { Water, Vegetables }


Confidence
Confidence is a measure of how often this rule is found to be true. It is defined as follows.

For example, in our case, the Confidence for the combination { Milk -> Cheese } would be

Lift
Lift of a rule is defined as following.

Now that we have got the math behind us, let’s define in simple terms what these terms mean.
- SupportX−>YSupportX−>Y – How frequent is this combination ? This is relatively straight forward – it is quite simply the total occurances of the combination in the entire transactions.
- ConfidenceX−>YConfidenceX−>Y – How often is this combination true ? Or, how likely is it that Y is purchased when X is purchased.
- LiftX−>YLiftX−>Y – Defines the strength of the relationship.
- Lift = 1
- P(X) = P(Y) – meaning both the events are unrelated.
- Lift > 1
- X is very related to Y . For example, in the example above, since Lift > 1 , it means that Milk is very strongly associated with Cheese or in other words, Milk & Cheese occur together more often than separately.
- Lift < 1
- X and Y have a negative relationship. In the case of Milk & Cheese above, if Lift was < 1, then Milk would NOT occur together with Cheese.
- Lift = 1
Now that we have got the math behind us, let’s go on to the implementation of the actual algorithms. We are going to focus now on just 2 of the rule mining Algorithms
- Apriori
- Eclact
Apriori Algorithm
Apriori is an algorithm that combs through large datasets to identify different rules(associations). At a high level this is how it works.
- Step 1 – Identify single items and how frequently they occur – Call this set 1. To reduce the complexity, we typically set a minimum support level.
- For example, in a supermarket dataset, this dataset identifies all the invidual items (Milk, Cheese, Water etc) and how frequently they occur. If some items (say exotic spices) are not all that frequent, they are removed from this set (not from the actual dataset)

set 1
Say we set a cut-off of at a frequency count of 2, we will only be left with 4 items (leave out Yoghurt).
- Step 2 – Prepare all 2-item combinations of items in set 1 . Once again go through the original dataset to find frequency of occurance of each of the 2-item combinations. Once again to reduce the complexity, set a minimum support level.
- _For example, among all the items, we have identified in set 1 above that {milk, cheese, water and vegetables} occur in frequency of at least 40%. Now, identify all 2-set combinations
set 2
Item set – Frequency

Once again, with a cut-off of 2, only 2 item sets remain
{Milk, Cheese}
{Water, Vegetables}
- Step 3 – Increase the combination size from set 2 and repeat step 3 recursively until no more sets are found.
Implementation
Scikit Learn does not have Association Rule mining algorithms. Luckily, there are many implementations of Apriori Algorithms in standard python. For example, one of the packages is MLxtend available as a standard python package that you can install using pip.
pip install mlextend
You can find more details on the implementation here
Let’s first increase the size of the dataset.University of California, Irvine (UCI) has a public version of an online retail dataset from UK. You can get it from here. We have also downloaded it here for you in the data directory. It is a 25 MB dataset that contains data from many different countries in the EU and Asia.
Different implementations require data to be in a different format. For example, mlextend requires the data to be in a sparse matrix format. Let’s do the required steps below.
- Read the dataset
- Process the dataset into sparse matrix format
- Run Apriori Algorithm
1. Read the Dataset
import pandas as pd
data = pd.read_excel('./data/online_retail.xlsx')
data.head()
InvoiceNo StockCode Description Quantity InvoiceDate UnitPriceCustomerID Country
0 536365 85123A WHITE HANGING HEART T-LIGHT HOLDER 6 2010-12-01 08:26:00 2.55 17850.0 United Kingdom
1 536365 71053 WHITE METAL LANTERN 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2 536365 84406B CREAM CUPID HEARTS COAT HANGER 8 2010-12-01 08:26:00 2.75 17850.0 United Kingdom
3 536365 84029G KNITTED UNION FLAG HOT WATER BOTTLE 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
4 536365 84029E RED WOOLLY HOTTIE WHITE HEART. 6 2010-12-01 08:26:00 3.39 17850.0 United Kingdom
2. Process the data into a Sparse Matrix
- Just in case Invoice Number is not a string, just convert it to a string.
data["Description"] = data["Description"].astype(str)
data["InvoiceNo"] = data["InvoiceNo"].astype(str)
- Drop all blanks
data = data.dropna()
- We are specifically interested in 3 columns
- Invoice Number
- Description
- Quantity
Since there is data related to many different countries, let’s restrict ourselves to just one country – say Germany. You could pick any country you like.
# Filter out just German data
data_de = data[data["Country"] == "Germany"]
In this dataset example, each item in the invoice(order) is spread out over one line. So, we have to essentially compress all the line items per invoice.
data_de = data_de.groupby(["InvoiceNo","Description"])["Quantity"].sum()
data_de.head(12)
InvoiceNo Description
536527 3 HOOK HANGER MAGIC GARDEN 12
5 HOOK HANGER MAGIC TOADSTOOL 12
5 HOOK HANGER RED MAGIC TOADSTOOL 12
ASSORTED COLOUR LIZARD SUCTION HOOK 24
CHILDREN'S CIRCUS PARADE MUG 12
HOMEMADE JAM SCENTED CANDLES 12
HOT WATER BOTTLE BABUSHKA 4
JUMBO BAG OWLS 10
JUMBO BAG WOODLAND ANIMALS 10
MULTI COLOUR SILVER T-LIGHT HOLDER 12
PACK 3 FIRE ENGINE/CAR PATCHES 12
PICTURE DOMINOES 12
Name: Quantity, dtype: int64
This is where we turn the dataframe sideways (into a sparse matrix).
data_de = data_de.unstack()
data_de.head()
Description 50'S CHRISTMAS GIFT BAG LARGE DOLLY GIRL BEAKER I LOVE LONDON MINI BACKPACK RED SPOT GIFT BAG LARGE SET 2 TEA TOWELS I LOVE LONDON SPACEBOY BABY GIFT SET 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS 12 IVORY ROSE PEG PLACE SETTINGS 12 MESSAGE CARDS WITH ENVELOPES ... YULETIDE IMAGES GIFT WRAP SET ZINC HEART T-LIGHT HOLDER ZINC STAR T-LIGHT HOLDER ZINC BOX SIGN HOME ZINC FOLKART SLEIGH BELLS ZINC HEART LATTICE T-LIGHT HOLDER ZINC METAL HEART DECORATION ZINC T-LIGHT HOLDER STAR LARGE ZINC T-LIGHT HOLDER STARS SMALL ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo
536527 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
536840 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
536861 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
536967 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
536983 NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
5 rows × 1703 columns
Wherever there is no quantity, Pandas just puts a NaN. The algorithm wants a zero in that case. So, let’s fill all NaN’s with 0.
data_de = data_de.fillna(0)
data_de.head()
Description 50'S CHRISTMAS GIFT BAG LARGE DOLLY GIRL BEAKER I LOVE LONDON MINI BACKPACK RED SPOT GIFT BAG LARGE SET 2 TEA TOWELS I LOVE LONDON SPACEBOY BABY GIFT SET 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS 12 IVORY ROSE PEG PLACE SETTINGS 12 MESSAGE CARDS WITH ENVELOPES ... YULETIDE IMAGES GIFT WRAP SET ZINC HEART T-LIGHT HOLDER ZINC STAR T-LIGHT HOLDER ZINC BOX SIGN HOME ZINC FOLKART SLEIGH BELLS ZINC HEART LATTICE T-LIGHT HOLDER ZINC METAL HEART DECORATION ZINC T-LIGHT HOLDER STAR LARGE ZINC T-LIGHT HOLDER STARS SMALL ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo
536527 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536840 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536861 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536967 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536983 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 1703 columns
# data_de = data_de.set_index("InvoiceNo")
# data_de.head()
Description 50'S CHRISTMAS GIFT BAG LARGE DOLLY GIRL BEAKER I LOVE LONDON MINI BACKPACK RED SPOT GIFT BAG LARGE SET 2 TEA TOWELS I LOVE LONDON SPACEBOY BABY GIFT SET 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS 12 IVORY ROSE PEG PLACE SETTINGS 12 MESSAGE CARDS WITH ENVELOPES ... YULETIDE IMAGES GIFT WRAP SET ZINC HEART T-LIGHT HOLDER ZINC STAR T-LIGHT HOLDER ZINC BOX SIGN HOME ZINC FOLKART SLEIGH BELLS ZINC HEART LATTICE T-LIGHT HOLDER ZINC METAL HEART DECORATION ZINC T-LIGHT HOLDER STAR LARGE ZINC T-LIGHT HOLDER STARS SMALL ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo
536527 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536840 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536861 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536967 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
536983 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
5 rows × 1703 columns
5 rows × 1703 columns
The bulk of the cells in the dataframe are the summation of quantities associated with a particular invoice. In case the quantity > 0, the algorithm requires us to reset it to 0. Basically, each cell is expected to be binary – either a 0 or 1. Let’s first write a function for that.
def reduce_to_binary(qty) :
if qty >= 1 :
return 1
if qty <= 0 :
return 0
Now, let’s apply it to all the cells in the dataframe.
data_de = data_de.applymap(reduce_to_binary)
data_de.head()
Description 50'S CHRISTMAS GIFT BAG LARGE DOLLY GIRL BEAKER I LOVE LONDON MINI BACKPACK RED SPOT GIFT BAG LARGE SET 2 TEA TOWELS I LOVE LONDON SPACEBOY BABY GIFT SET 10 COLOUR SPACEBOY PEN 12 COLOURED PARTY BALLOONS 12 IVORY ROSE PEG PLACE SETTINGS 12 MESSAGE CARDS WITH ENVELOPES ... YULETIDE IMAGES GIFT WRAP SET ZINC HEART T-LIGHT HOLDER ZINC STAR T-LIGHT HOLDER ZINC BOX SIGN HOME ZINC FOLKART SLEIGH BELLS ZINC HEART LATTICE T-LIGHT HOLDER ZINC METAL HEART DECORATION ZINC T-LIGHT HOLDER STAR LARGE ZINC T-LIGHT HOLDER STARS SMALL ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo
536527 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
536840 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
536861 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
536967 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
536983 0 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
5 rows × 1703 columns
from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules
frequent_items = apriori(data_de, min_support=0.07, use_colnames=True)
frequent_itemsets = apriori(data_de, min_support=0.07, use_colnames=True)
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (PLASTERS IN TIN CIRCUS PARADE ) (POSTAGE) 0.087894 0.620232 0.076285 0.867925 1.399354 0.021771 2.875385
1 (POSTAGE) (PLASTERS IN TIN CIRCUS PARADE ) 0.620232 0.087894 0.076285 0.122995 1.399354 0.021771 1.040023
2 (PLASTERS IN TIN SPACEBOY) (POSTAGE) 0.081260 0.620232 0.076285 0.938776 1.513587 0.025885 6.202875
3 (POSTAGE) (PLASTERS IN TIN SPACEBOY) 0.620232 0.081260 0.076285 0.122995 1.513587 0.025885 1.047587
4 (PLASTERS IN TIN WOODLAND ANIMALS) (POSTAGE) 0.104478 0.620232 0.089552 0.857143 1.381971 0.024752 2.658375
Looks like there is an item called “POSTAGE” that is basically the Shipping Costs. We don’t need that, right ? Let’s get rid of it and run it again.
data_de = data_de.drop(columns=['POSTAGE'])
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
<ipython-input-286-bf0f03a3e119> in <module>
----> 1 data_de = data_de.drop(columns=['POSTAGE'])
c:\program files (x86)\python37-32\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
3938 index=index, columns=columns,
3939 level=level, inplace=inplace,
-> 3940 errors=errors)
3941
3942 @rewrite_axis_style_signature('mapper', [('copy', True),
c:\program files (x86)\python37-32\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
3778 for axis, labels in axes.items():
3779 if labels is not None:
-> 3780 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
3781
3782 if inplace:
c:\program files (x86)\python37-32\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
3810 new_axis = axis.drop(labels, level=level, errors=errors)
3811 else:
-> 3812 new_axis = axis.drop(labels, errors=errors)
3813 result = self.reindex(**{axis_name: new_axis})
3814
c:\program files (x86)\python37-32\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
4963 if errors != 'ignore':
4964 raise KeyError(
-> 4965 '{} not found in axis'.format(labels[mask]))
4966 indexer = indexer[~mask]
4967 return self.delete(indexer)
KeyError: "['POSTAGE'] not found in axis"
frequent_items = apriori(data_de, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (PLASTERS IN TIN CIRCUS PARADE ) (PLASTERS IN TIN WOODLAND ANIMALS) 0.087894 0.104478 0.051410 0.584906 5.598383 0.042227 2.157395
1 (PLASTERS IN TIN WOODLAND ANIMALS) (PLASTERS IN TIN CIRCUS PARADE ) 0.104478 0.087894 0.051410 0.492063 5.598383 0.042227 1.795709
2 (PLASTERS IN TIN WOODLAND ANIMALS) (ROUND SNACK BOXES SET OF4 WOODLAND ) 0.104478 0.185738 0.056385 0.539683 2.905612 0.036979 1.768914
3 (ROUND SNACK BOXES SET OF4 WOODLAND ) (PLASTERS IN TIN WOODLAND ANIMALS) 0.185738 0.104478 0.056385 0.303571 2.905612 0.036979 1.285878
4 (ROUND SNACK BOXES SET OF 4 FRUITS ) (ROUND SNACK BOXES SET OF4 WOODLAND ) 0.119403 0.185738 0.099502 0.833333 4.486607 0.077325 4.885572
Now, we are getting somewhere. We have specifically chosen a minimum support (min_support) of 0.05 – meaning, only look at items that occur at least 5% of the time in the datasets. You can case your net wide and capture rare items as well if you want.
However, we are more interested specifically in the lift column. The more the lift , the more the relevance. So, let’s sort by “lift”.
rules.sort_values(by = ["lift"],ascending=False)
antecedents consequents antecedent support consequent support support confidence lift leverage conviction
0 (PLASTERS IN TIN CIRCUS PARADE ) (PLASTERS IN TIN WOODLAND ANIMALS) 0.087894 0.104478 0.051410 0.584906 5.598383 0.042227 2.157395
1 (PLASTERS IN TIN WOODLAND ANIMALS) (PLASTERS IN TIN CIRCUS PARADE ) 0.104478 0.087894 0.051410 0.492063 5.598383 0.042227 1.795709
5 (ROUND SNACK BOXES SET OF4 WOODLAND ) (ROUND SNACK BOXES SET OF 4 FRUITS ) 0.185738 0.119403 0.099502 0.535714 4.486607 0.077325 1.896670
4 (ROUND SNACK BOXES SET OF 4 FRUITS ) (ROUND SNACK BOXES SET OF4 WOODLAND ) 0.119403 0.185738 0.099502 0.833333 4.486607 0.077325 4.885572
6 (SPACEBOY LUNCH BOX ) (ROUND SNACK BOXES SET OF4 WOODLAND ) 0.077944 0.185738 0.053068 0.680851 3.665653 0.038591 2.551354
7 (ROUND SNACK BOXES SET OF4 WOODLAND ) (SPACEBOY LUNCH BOX ) 0.185738 0.077944 0.053068 0.285714 3.665653 0.038591 1.290879
3 (ROUND SNACK BOXES SET OF4 WOODLAND ) (PLASTERS IN TIN WOODLAND ANIMALS) 0.185738 0.104478 0.056385 0.303571 2.905612 0.036979 1.285878
2 (PLASTERS IN TIN WOODLAND ANIMALS) (ROUND SNACK BOXES SET OF4 WOODLAND ) 0.104478 0.185738 0.056385 0.539683 2.905612 0.036979 1.768914