Dimensionality Reduction

Dimensionality Reduction


  Machine Learning in Python

Contents

What is Dimensionality Reduction

When you can do with just 2 variables, do you need 4 ? What do I mean by that ? Say you are trying to do classification using the iris dataset. You already know by now that there are 4 predictors

  • Sepal Length
  • Sepal Width
  • Petal Length
  • Petal Width

To classify the target species, you can use any of the classification methods that you studied so far. What if I tell you that you can get the same level of classification accuracy in determining the target species with just 2 predictors. Let me show you with an example.

# Load the dataset
from sklearn import datasets
from sklearn.model_selection import train_test_split

iris = datasets.load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2, random_state=100)
X_train[0:4,:]

array([[5.5, 2.4, 3.7, 1. ],
       [5.7, 2.8, 4.1, 1.3],
       [6. , 2.2, 5. , 1.5],
       [4.8, 3. , 1.4, 0.1]])
from sklearn import linear_model
from scipy.special import expit

model = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
model.score(X_test,y_test)
c:\program files (x86)\python37-32\lib\site-packages\sklearn\linear_model\logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)

0.9666666666666667

That’s an accuracy of 96% with the standard dataset using Logistic Regression. Let’s test our claim ( to be able to get the same level of accuracy with just 2 variables).

# magic code
from sklearn.decomposition import PCA

pca = PCA(n_components = 2)
pca.fit(iris.data)

# Transofm the iris data to just 2 dimensions
iris_red = pca.transform(iris.data)
X_train, X_test, y_train, y_test = train_test_split(iris_red, iris.target, test_size=0.2, random_state=100)
X_train[0:4,:]

array([[-0.18962247, -0.68028676],
       [ 0.29900084, -0.34889781],
       [ 1.30079171, -0.76114964],
       [-2.78610927, -0.235112  ]])
# Now, let's fit the model with just these 2 variables.
model_red = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
model.fit(X_train, y_train)
model.score(X_test,y_test)

c:\program files (x86)\python37-32\lib\site-packages\sklearn\linear_model\logistic.py:460: FutureWarning: Default multi_class will be changed to 'auto' in 0.22. Specify the multi_class option to silence this warning.
  "this warning.", FutureWarning)
0.9666666666666667

voila!! That’s still a 96% accuracy. How did we achieve this magic with just 2 variables ?

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(X_train[:,0],X_train[:,1],c = y_train)

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(iris.data[:,0],iris.data[:,1],c = iris.target)

The ability to reduce the number of predictors(variables) while still preserving the predictive power of the original predictors is called Dimensionality Reduction

The question that comes up next is, why reduce dimensions ? In the case of the iris dataset, there are just 4 dimensions. So, there is not much of an obvious advantage. But imagine hundreds or thousands of dimentions – and it’s not an unrealistic problem either.

  • Simple 100 x 100 pixel images are actually a dataset with 10000 dimensions.

In cases like these, it makes sense to reduce the number of dimensions.

Hughes Phenomenon

In the late 70’s, G. Hughes observed a strange phonomenon while observing large datasets. It is best summarized by the picture below.

Intuitively, it would be easy to understand that the more inputs (features) you provide the model, the more the predictive power of the model. However, after a certain point the accuracy of the prediction drops off. This is the essence of Hughes Phenomonon. It is based on the fact that measuring data (features or variables) typically has some level of error to it. When you compound this error over a large number of variables, the error explodes so much that the accuracy is affected. }

Curse of Dimensionality

Also, in terms of the data required for accurate prediction, more dimensions mean more data. The phrase “Curse of Dimensionality” means

As the number of dimensions(or features) grow, the amount of data we need to generalize accurately grows exponentially.

The reason for this curse is becauase as the number of dimensions increase, the sparsity of the data increases. You can see this in the picture below that contrasts the data sparsity of 1-d vs 2-d.

More features doesn’t necessarily mean better accuracy.

The Solution

The solution to this problem is Dimensionality Reduction. However, this is not the same as feature selection. Look at ths picture below –

As you can see, the reduced dimensions are not the same as the original dimensions.

There are many ways in which dimensionality reduction can be done. Here are some of the ways

  • Principal Component Analysis (PCA)
  • Linear Discriminant Analysis (LDA)
  • Random Projection
  • Feature Agglomeration etc

Scikit learn supports some of the methods. However, the actual technique is not all that important – as long as you understand why it is used and apply it when required. In this section, we will try to understand just one of the ways in which dimensionality reduction can be done – PCA.

Principal Component Analysis

Principal Component Analysis is based on identifying dimensions(variables/predictors) with the largest variance and projecting the other variables onto those dimensions. For example, look at the picture below.

  • Left – Plot of age vs height to classify men vs women
  • Right – Plot of nationality vs height to classify men vs women

In the plot on the left, both height and weight seems to have a significant impact in classifying the person as male or female. On the other hand, the plot on the right only the dimension height seems to have an impact in classifying the person. Nationality does not have any impact at all – virtually there is just one predictor – height. It is very visual in this case.

PCA leverages this and approximates this to reduce dimensions. The metric PCA uses is variance – specifically co-variance. Let’s understand this a bit more mathematically.

Variance 

In a previous section ( Probability and Statistics ) we have seen what variance is. It is a measure of how spread out the data is. Principal components are the directions in which there is the the most variance. Let’s see an example.

Eigen Vectors and Eigen Values

Think of Eigen Value as the variance in the direction of the Eigen Vector. The math behind PCA is to find the correct Eigen Vector (Where the Eigen Value is maximum). Let’s not get into the math behind finding the actual Eigen Vector.

Once the data is projected the data onto the Principal Component (the correct Eigen Vector), you basically have reduced the number of dimensions from two to one.

Scikit Learn also has multiple variations to PCA – For example, Kernel PCA is an algorithm that can do dimensionality reduction using kernels. For example, data that can’t be linearly separated like a circle can also be separated.

import numpy as np
import matplotlib.pyplot as plt

from sklearn.decomposition import PCA, KernelPCA
from sklearn.datasets import make_circles

np.random.seed(0)

X, y = make_circles(n_samples=400, factor=.3, noise=.05)

plt.scatter(X[:, 0], X[:, 1], c= y)
pca = KernelPCA(n_components = 1)
pca.fit(X)
new_data_kernel = pca.transform(X)
plt.scatter(new_data_kernel,y,c=y)

Associative Rule Learning

Associative Rule Learning


  Machine Learning in Python

Contents

What is Association Rule Learning

Associative Rule Learning (or mining) is a Machine Learning Algorithm for discovering relationship between variables. What is new about this, you must be wondering ? Standarad statistical methods like correlation or regression also do the same thing, right ?

Well, for beginners, thos are typically either supervised algorithms or a way to quantify relationship between a known set of variables – For example, find the relationship between

  • smoking and cancer
  • cholesterol and heart disease etc

Associative Rule Learning on the other hand discovers or learns relationships between variables that you might not be aware of. That is why it is classified as an unsupervised Machine Learning Algorithm. This was first discovered in 1993 when a group of researchers were interested in finding out the relationship between items sold in supermarkets based on data got from their Point-of-Sale systems. Here are two classic examples.

  • the classic example of an unusual relationship that is hard to miss for human intuition is the relationship between Baby Diapers and beer in supermarket sales.
  • Another example of doing this on a large scale is movie recommender systems in Netflix, Amazon Prime Video. Even if you have not experienced Netflix or Amazon Prime Video, you must have already experienced this as part of Youtube video recommendations. It is pretty accurate actually.

Key Terms

Before we get into the actual algorithms, let’s understand a couple of key terms

  • Support
  • Lift
  • Confidence

Once we understand these terms, we can move on to the actual algoritm itself.

Imagine I have made the following transactions in a super market over the course of a week.

Txn ID – Items

  • 154265 – { Milk }
  • 645858 – { Milk, Cheese }
  • 588455 – { Milk, Water, Vegetables }
  • 554855 – { Milk, Cheese, Yoghurt, Water }
  • 558965 – { Water, Vegetables

Say we are trying to quantify the association (or rule) between the items Milk and Cheese. Specifically the association

  • Milk -> Cheese

and not the other way around ( NOT Cheese -> Milk ). Meaning, we are trying to quantify the association that implies that I buy Cheese if I already have Milk in my basket.

Support

Support is a measure of how frequent a item or an item set appears in a dataset. For example, what is the support for the item set { Milk + Cheese } ?

  • 1 – { Milk }
  • 2 – { Milk, Cheese }
  • 3 – { Milk, Water, Vegetables }
  • 4 – { Milk, Cheese, Yoghurt, Water }
  • 5 – { Water, Vegetables }

Confidence

Confidence is a measure of how often this rule is found to be true. It is defined as follows.

For example, in our case, the Confidence for the combination { Milk -> Cheese } would be

Lift

Lift of a rule is defined as following.

Now that we have got the math behind us, let’s define in simple terms what these terms mean.

  • SupportX−>YSupportX−>Y – How frequent is this combination ? This is relatively straight forward – it is quite simply the total occurances of the combination in the entire transactions.
  • ConfidenceX−>YConfidenceX−>Y – How often is this combination true ? Or, how likely is it that Y is purchased when X is purchased.
  • LiftX−>YLiftX−>Y – Defines the strength of the relationship.
    • Lift = 1
      • P(X) = P(Y) – meaning both the events are unrelated.
    • Lift > 1
      • X is very related to Y . For example, in the example above, since Lift > 1 , it means that Milk is very strongly associated with Cheese or in other words, Milk & Cheese occur together more often than separately.
    • Lift < 1
      • X and Y have a negative relationship. In the case of Milk & Cheese above, if Lift was < 1, then Milk would NOT occur together with Cheese.

Now that we have got the math behind us, let’s go on to the implementation of the actual algorithms. We are going to focus now on just 2 of the rule mining Algorithms

  • Apriori
  • Eclact

Apriori Algorithm

Apriori is an algorithm that combs through large datasets to identify different rules(associations). At a high level this is how it works.

  • Step 1 – Identify single items and how frequently they occur – Call this set 1. To reduce the complexity, we typically set a minimum support level.
    • For example, in a supermarket dataset, this dataset identifies all the invidual items (Milk, Cheese, Water etc) and how frequently they occur. If some items (say exotic spices) are not all that frequent, they are removed from this set (not from the actual dataset)

set 1 

Say we set a cut-off of at a frequency count of 2, we will only be left with 4 items (leave out Yoghurt).

  • Step 2 – Prepare all 2-item combinations of items in set 1 . Once again go through the original dataset to find frequency of occurance of each of the 2-item combinations. Once again to reduce the complexity, set a minimum support level.
    • _For example, among all the items, we have identified in set 1 above that {milk, cheese, water and vegetables} occur in frequency of at least 40%. Now, identify all 2-set combinations

set 2 
Item set – Frequency 

Once again, with a cut-off of 2, only 2 item sets remain 

{Milk, Cheese}
{Water, Vegetables}

  • Step 3 – Increase the combination size from set 2 and repeat step 3 recursively until no more sets are found. 

Implementation

Scikit Learn does not have Association Rule mining algorithms. Luckily, there are many implementations of Apriori Algorithms in standard python. For example, one of the packages is MLxtend available as a standard python package that you can install using pip.

pip install mlextend

You can find more details on the implementation here

Let’s first increase the size of the dataset.University of California, Irvine (UCI) has a public version of an online retail dataset from UK. You can get it from here. We have also downloaded it here for you in the data directory. It is a 25 MB dataset that contains data from many different countries in the EU and Asia.

Different implementations require data to be in a different format. For example, mlextend requires the data to be in a sparse matrix format. Let’s do the required steps below.

  • Read the dataset
  • Process the dataset into sparse matrix format
  • Run Apriori Algorithm

1. Read the Dataset

import pandas as pd

data = pd.read_excel('./data/online_retail.xlsx')
data.head()

                 InvoiceNo	StockCode	   Description	                                                   Quantity	InvoiceDate	                  UnitPriceCustomerID	Country
0	536365	85123A	WHITE HANGING HEART T-LIGHT HOLDER	6	2010-12-01 08:26:00	2.55	17850.0	United Kingdom
1	536365	71053	WHITE METAL LANTERN	                                      6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
2	536365	84406B	CREAM CUPID HEARTS COAT HANGER	                   8	2010-12-01 08:26:00	2.75	17850.0	United Kingdom
3	536365	84029G	KNITTED UNION FLAG HOT WATER BOTTLE	6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom
4	536365	84029E	RED WOOLLY HOTTIE WHITE HEART.	                   6	2010-12-01 08:26:00	3.39	17850.0	United Kingdom

2. Process the data into a Sparse Matrix

  • Just in case Invoice Number is not a string, just convert it to a string.
data["Description"] = data["Description"].astype(str)
data["InvoiceNo"] = data["InvoiceNo"].astype(str)

  • Drop all blanks
data = data.dropna()
  • We are specifically interested in 3 columns
    • Invoice Number
    • Description
    • Quantity

Since there is data related to many different countries, let’s restrict ourselves to just one country – say Germany. You could pick any country you like.

# Filter out just German data
data_de = data[data["Country"] == "Germany"]

In this dataset example, each item in the invoice(order) is spread out over one line. So, we have to essentially compress all the line items per invoice.

data_de = data_de.groupby(["InvoiceNo","Description"])["Quantity"].sum()
data_de.head(12)

InvoiceNo  Description                        
536527     3 HOOK HANGER MAGIC GARDEN             12
           5 HOOK HANGER MAGIC TOADSTOOL          12
           5 HOOK HANGER RED MAGIC TOADSTOOL      12
           ASSORTED COLOUR LIZARD SUCTION HOOK    24
           CHILDREN'S CIRCUS PARADE MUG           12
           HOMEMADE JAM SCENTED CANDLES           12
           HOT WATER BOTTLE BABUSHKA               4
           JUMBO BAG OWLS                         10
           JUMBO BAG WOODLAND ANIMALS             10
           MULTI COLOUR SILVER T-LIGHT HOLDER     12
           PACK 3 FIRE ENGINE/CAR PATCHES         12
           PICTURE DOMINOES                       12
Name: Quantity, dtype: int64

This is where we turn the dataframe sideways (into a sparse matrix).

data_de = data_de.unstack()
data_de.head()

Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536840	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536861	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536967	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
536983	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
5 rows × 1703 columns

Wherever there is no quantity, Pandas just puts a NaN. The algorithm wants a zero in that case. So, let’s fill all NaN’s with 0.

data_de = data_de.fillna(0)
data_de.head()
Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536840	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536861	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536967	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536983	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0

5 rows × 1703 columns

# data_de = data_de.set_index("InvoiceNo")
# data_de.head()

Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536840	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536861	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536967	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
536983	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
5 rows × 1703 columns

5 rows × 1703 columns

The bulk of the cells in the dataframe are the summation of quantities associated with a particular invoice. In case the quantity > 0, the algorithm requires us to reset it to 0. Basically, each cell is expected to be binary – either a 0 or 1. Let’s first write a function for that.

def reduce_to_binary(qty) : 
    if qty &gt;= 1 :
        return 1
    if qty <= 0 :
        return 0

Now, let’s apply it to all the cells in the dataframe.

data_de = data_de.applymap(reduce_to_binary)
data_de.head()
Description	50'S CHRISTMAS GIFT BAG LARGE	DOLLY GIRL BEAKER	I LOVE LONDON MINI BACKPACK	RED SPOT GIFT BAG LARGE	SET 2 TEA TOWELS I LOVE LONDON	SPACEBOY BABY GIFT SET	10 COLOUR SPACEBOY PEN	12 COLOURED PARTY BALLOONS	12 IVORY ROSE PEG PLACE SETTINGS	12 MESSAGE CARDS WITH ENVELOPES	...	YULETIDE IMAGES GIFT WRAP SET	ZINC HEART T-LIGHT HOLDER	ZINC STAR T-LIGHT HOLDER	ZINC BOX SIGN HOME	ZINC FOLKART SLEIGH BELLS	ZINC HEART LATTICE T-LIGHT HOLDER	ZINC METAL HEART DECORATION	ZINC T-LIGHT HOLDER STAR LARGE	ZINC T-LIGHT HOLDER STARS SMALL	ZINC WILLIE WINKIE CANDLE STICK
InvoiceNo																					
536527	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536840	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536861	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536967	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
536983	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

5 rows × 1703 columns

from mlxtend.frequent_patterns import apriori
from mlxtend.frequent_patterns import association_rules

frequent_items = apriori(data_de, min_support=0.07, use_colnames=True)
frequent_itemsets = apriori(data_de, min_support=0.07, use_colnames=True)

rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()

antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(POSTAGE)	0.087894	0.620232	0.076285	0.867925	1.399354	0.021771	2.875385
1	(POSTAGE)	(PLASTERS IN TIN CIRCUS PARADE )	0.620232	0.087894	0.076285	0.122995	1.399354	0.021771	1.040023
2	(PLASTERS IN TIN SPACEBOY)	(POSTAGE)	0.081260	0.620232	0.076285	0.938776	1.513587	0.025885	6.202875
3	(POSTAGE)	(PLASTERS IN TIN SPACEBOY)	0.620232	0.081260	0.076285	0.122995	1.513587	0.025885	1.047587
4	(PLASTERS IN TIN WOODLAND ANIMALS)	(POSTAGE)	0.104478	0.620232	0.089552	0.857143	1.381971	0.024752	2.658375

Looks like there is an item called “POSTAGE” that is basically the Shipping Costs. We don’t need that, right ? Let’s get rid of it and run it again.

data_de = data_de.drop(columns=['POSTAGE'])
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-286-bf0f03a3e119&gt; in <module&gt;
----&gt; 1 data_de = data_de.drop(columns=['POSTAGE'])

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\frame.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3938                                            index=index, columns=columns,
   3939                                            level=level, inplace=inplace,
-&gt; 3940                                            errors=errors)
   3941 
   3942     @rewrite_axis_style_signature('mapper', [('copy', True),

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\generic.py in drop(self, labels, axis, index, columns, level, inplace, errors)
   3778         for axis, labels in axes.items():
   3779             if labels is not None:
-&gt; 3780                 obj = obj._drop_axis(labels, axis, level=level, errors=errors)
   3781 
   3782         if inplace:

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\generic.py in _drop_axis(self, labels, axis, level, errors)
   3810                 new_axis = axis.drop(labels, level=level, errors=errors)
   3811             else:
-&gt; 3812                 new_axis = axis.drop(labels, errors=errors)
   3813             result = self.reindex(**{axis_name: new_axis})
   3814 

c:\program files (x86)\python37-32\lib\site-packages\pandas\core\indexes\base.py in drop(self, labels, errors)
   4963             if errors != 'ignore':
   4964                 raise KeyError(
-&gt; 4965                     '{} not found in axis'.format(labels[mask]))
   4966             indexer = indexer[~mask]
   4967         return self.delete(indexer)

KeyError: "['POSTAGE'] not found in axis"
frequent_items = apriori(data_de, min_support=0.05, use_colnames=True)
rules = association_rules(frequent_items, metric="lift", min_threshold=1)
rules.head()

antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.087894	0.104478	0.051410	0.584906	5.598383	0.042227	2.157395
1	(PLASTERS IN TIN WOODLAND ANIMALS)	(PLASTERS IN TIN CIRCUS PARADE )	0.104478	0.087894	0.051410	0.492063	5.598383	0.042227	1.795709
2	(PLASTERS IN TIN WOODLAND ANIMALS)	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.104478	0.185738	0.056385	0.539683	2.905612	0.036979	1.768914
3	(ROUND SNACK BOXES SET OF4 WOODLAND )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.185738	0.104478	0.056385	0.303571	2.905612	0.036979	1.285878
4	(ROUND SNACK BOXES SET OF 4 FRUITS )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.119403	0.185738	0.099502	0.833333	4.486607	0.077325	4.885572

Now, we are getting somewhere. We have specifically chosen a minimum support (min_support) of 0.05 – meaning, only look at items that occur at least 5% of the time in the datasets. You can case your net wide and capture rare items as well if you want.

However, we are more interested specifically in the lift column. The more the lift , the more the relevance. So, let’s sort by “lift”.

rules.sort_values(by = ["lift"],ascending=False)
antecedents	consequents	antecedent support	consequent support	support	confidence	lift	leverage	conviction
0	(PLASTERS IN TIN CIRCUS PARADE )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.087894	0.104478	0.051410	0.584906	5.598383	0.042227	2.157395
1	(PLASTERS IN TIN WOODLAND ANIMALS)	(PLASTERS IN TIN CIRCUS PARADE )	0.104478	0.087894	0.051410	0.492063	5.598383	0.042227	1.795709
5	(ROUND SNACK BOXES SET OF4 WOODLAND )	(ROUND SNACK BOXES SET OF 4 FRUITS )	0.185738	0.119403	0.099502	0.535714	4.486607	0.077325	1.896670
4	(ROUND SNACK BOXES SET OF 4 FRUITS )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.119403	0.185738	0.099502	0.833333	4.486607	0.077325	4.885572
6	(SPACEBOY LUNCH BOX )	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.077944	0.185738	0.053068	0.680851	3.665653	0.038591	2.551354
7	(ROUND SNACK BOXES SET OF4 WOODLAND )	(SPACEBOY LUNCH BOX )	0.185738	0.077944	0.053068	0.285714	3.665653	0.038591	1.290879
3	(ROUND SNACK BOXES SET OF4 WOODLAND )	(PLASTERS IN TIN WOODLAND ANIMALS)	0.185738	0.104478	0.056385	0.303571	2.905612	0.036979	1.285878
2	(PLASTERS IN TIN WOODLAND ANIMALS)	(ROUND SNACK BOXES SET OF4 WOODLAND )	0.104478	0.185738	0.056385	0.539683	2.905612	0.036979	1.768914

Clustering

Clustering


  Machine Learning in Python

Contents

What is Clustering

Imagine weights of the students of three different classes (grades). Just to make the weight groups clearly distinguisable from each other, let’s pick classes a bit far from each other – say 1st, 4th and 7th.

# code to reproduce the data

import numpy as np

# set the seed to make this reproducible
np.random.seed(100)

# samples per bucket
size = 5

# generate 10 numbers in a normal distribution with a set mean and sd.
# convert them into integers for simplicity

weights_1 = np.around(np.random.normal(20, 5, size)).astype(int)
weights_4 = np.around(np.random.normal(40, 5, size)).astype(int)
weights_7 = np.around(np.random.normal(70, 5, size)).astype(int)

Let’s plot the data and mark that these are different sets of data.

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights_1, np.zeros(size),alpha=0.5)
plt.scatter(weights_4, np.zeros(size),alpha=0.5)
plt.scatter(weights_7, np.zeros(size),alpha=0.5)

# draw vertical lines at the mean points to show that these are three distinct groups
plt.axvline(20, linestyle="--",alpha=0.5, color="blue")
plt.axvline(40, linestyle="--",alpha=0.5, color="orange")
plt.axvline(70, linestyle="--",alpha=0.5, color="green")

The data seems pretty straight forward (now that we have colored and marked the mean. But clustering is the process of grouping data points into clusters automatically. Imagine giving this data to somebody without telling them the specific buckets – something like this.

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights_1, np.zeros(size),alpha=0.5, color="gray")
plt.scatter(weights_4, np.zeros(size),alpha=0.5, color="gray")
plt.scatter(weights_7, np.zeros(size),alpha=0.5, color="gray")

How would that person cluster the data into buckets ? Since this is data in just one dimension, human mind can intuitively do a whole lot of calculation. For example, just looking at the data anybody would be able to make out the following groups.

A couple of points here and there, but mostly the conclusion is pretty reasonable. If a machine does the same thing it is called clustering. There are many algorithms for clustering.

  • K-means clustering
  • Mean shift clustering
  • Hierarchical clustering etc

k-means clustering

Let’s see how k-means clustering can cluster this data automatically. Here is the algorithm steps.

  • Step 1 – Pick k points – Call them cluster centroids. At this poitn they are just random.
  • Step 2 – For each point in the dataset, find out the closest cluster centroid. Asssign that point to that cluster. After finishing step no. 2, each of the points belong to one of the k clusters.
  • Step 3 – Calculate the new centroid/mean of the cluster. Move the initial cluster centroids ( randomly set in step 1 ) to the new centroid.

Continue Step 2 and Step 3 until the centroids no longer change.

Let’s do this manually for just two of the weight groups

  • weights_1
  • weights_7
# combine these arrays into a single vector.
weights = np.concatenate((weights_1.ravel(), weights_7.ravel()))
weights
array([11, 22, 26, 19, 25, 68, 72, 67, 74, 73])
# Step 1
# pick 2 centroids at random. 
from random import randrange

centroid_1 = randrange( np.amin(weights),  np.amax(weights) )
centroid_2 = randrange( np.amin(weights),  np.amax(weights) )

centroid_1
65
centroid_2
55
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(centroid_1,0,color="green", marker="*")
plt.scatter(centroid_2,0,color="red", marker="*")

Step 2 – Mark each point as closer to one cluster point or the other.

Visually, you can see that all the points to the left are closer to the red dot and all the points to the right are closer to the gree dot.

Step 3 – Calculate the new centroids.

weights = np.sort(weights)
weights

array([11, 19, 22, 25, 26, 67, 68, 72, 73, 74])
centroid_1 = np.array ([11, 19, 22, 25, 26,55]).mean()
centroid_2 = np.array ([ 67, 68, 72, 73, 74, 65]).mean()

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(centroid_1,0,color="red", marker="*")
plt.scatter(centroid_2,0,color="green", marker="*")

plt.scatter(55,0,color="red", marker="*", alpha=0.5)
plt.scatter(65,0,color="green", marker="*",alpha=0.5)

plt.annotate("old centroid", (55,0), arrowprops= dict(arrowstyle="-&gt;"),xytext=(60,0.002),color="red");
plt.annotate("new centroid", (centroid_1,0), arrowprops= dict(arrowstyle="-&gt;"),xytext=(30,0.002),color="red");

plt.annotate("new centroid", (centroid_2,0), arrowprops= dict(arrowstyle="-&gt;"),xytext=(60,-0.002),color="green");
plt.annotate("old centroid", (65,0), arrowprops= dict(arrowstyle="-&gt;"),xytext=(40,-0.002),color="green");

centroid_1_third = np.array ([11, 19, 22, 25, 26,centroid_1]).mean()
centroid_2_third = np.array ([ 67, 68, 72, 73, 74, centroid_2]).mean()

import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(centroid_1,0,color="red", marker="*")
plt.scatter(centroid_2,0,color="green", marker="*")

# centroid 1
plt.scatter(55,0,color="red", marker="*", alpha=0.2)
plt.scatter(65,0,color="green", marker="*",alpha=0.2)

# centroid 2
plt.scatter(centroid_1,0,color="red", marker="*", alpha=0.5)
plt.scatter(centroid_2,0,color="green", marker="*",alpha=0.5)

#centroid 3
plt.scatter(centroid_1_third,0,color="red", marker="*")
plt.scatter(centroid_2_third,0,color="green", marker="*")

plt.annotate("new centroid", (centroid_1_third,0), arrowprops= dict(arrowstyle="-&gt;"),xytext=(30,0.002),color="red");
plt.annotate("new centroid", (centroid_2_third,0), arrowprops= dict(arrowstyle="-&gt;"),xytext=(60,-0.002),color="green");

Do this repeatedly until the centroid no longer moves. That’s it – That is k means clustering. Now, let’s do it in python using scikit learn.

from sklearn.cluster import KMeans

kmeans = KMeans(n_clusters=2, random_state=0).fit(weights.reshape(-1,1))

kmeans.cluster_centers_
array([[70.8],
       [20.6]])

There you go, those are the clusters. Pretty close to our picture above, right ? Now, we can try to predict a point – say we want to where the following points fall ?

  • 30
  • 60
  • 90
  • 5
  • 40
  • 45
data = np.array([30,60,90,5,40,45]).reshape(-1,1)

predict_class = kmeans.predict(data)
predict_class
array([1, 0, 0, 1, 1, 1])
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(weights, np.zeros(10),alpha=0.5, color="gray")
plt.scatter(data,np.zeros(len(data)), c=predict_class)

All the yellow dots belong to the first cluster and the dark blue dots belong to the second cluster.

Here is a quick visual of a smaller dataset in 2 dimensional space. What I want you to focus on is how the centroid moves with each iteration.

Optimum k value

Now, for the most important question, how do you find out what is the optimum K value. For example, in the previous example, we told the machine to classify the data into 2 clusters. Most of the time, you wouldn’t know the number of clusters. In cases like that you need to find out what is the optimum k value. That is where the sum of squared distances to the cluster comes in. It is also called intertia. It is available readymade as an attribute of the clustering model.

kmeans.inertia_
184.0

Now, to find out the optimum k value, let’s iterate over 1 to 10 to find out the optimum intertia or sum of squares.

# Hold the interia or sum of squard distances in this array
sum_squares = []

for i in range(1,10) :
    model = KMeans(n_clusters = i, max_iter = 100)
    model.fit(weights.reshape(-1,1))
    
    sum_squares.append(model.inertia_)
plt.plot( range(1,10), sum_squares)
plt.xlabel(" K value ")
plt.ylabel("Sum of Squared distances or Inertia")
Text(0, 0.5, 'Sum of Squared distances or Inertia')

This is called as an elbow plot – looks pretty much like an elbow, right ? Beyond the bend in the elbow (inflexion point), after a k-value of 2, there is no more decrease in sum of squared distances or interia. So, the optimum k value in this case is 2 – which is correct.

Let’s add the third weight category as well and see if this changes.

# combine these arrays into a single vector.
weights = np.concatenate((weights_1.ravel(), weights_4.ravel(), weights_7.ravel()))

# Hold the interia or sum of squard distances in this array
sum_squares = []

for i in range(1,10) :
    model = KMeans(n_clusters = i, max_iter = 100)
    model.fit(weights.reshape(-1,1))
    
    sum_squares.append(model.inertia_)
    
plt.plot( range(1,10), sum_squares)
plt.xlabel(" K value ")
plt.ylabel("Sum of Squared distances or Inertia")

Text(0, 0.5, 'Sum of Squared distances or Inertia')

It does – Now the elbow point moved to a k value of 3. Which proves that the elbow method works. Yay !!