Python Decision Trees

Decision Trees


  Machine Learning in Python

Contents


What are Decision Trees

Let me take you back to the number guessing game that we have played on day 1 of the course. It is a simple game where the computer chooses a random number between 1 and 100 and you have to guess the number. After each guess, the program helps you by telling if your guess is higher or lower than the chosen number. Say the number chosen is 60. Let’s visualize this.

Basically, it is a series of decisions based on the clue you get from the program. For lack of a better intelligence, we just predict the middle number on either side ( higher or lower ). We can think of the same process using a decision tree.

A decision tree is essentially a series of decisions that are based on the data you are working with. For example, if you are guessing a number between 1 and 1000, the decision tree would have been much bigger. In this case, the guesses (cuts in the number line) are exactly in the middle – for lack of a better guessing method. However, a real decision tree makes a much more informed decision. Once again, let me show this with a simple example.

Take an apple that is rotten somewhere at the side.

Our goal is to find a series of cuts that maximises the fresh apple portion (and minimizes the rotten portion) with the least possible cuts. How would you do it ?

Something like this – The criteria you would be using to make the cuts is based on the maximum area(volume) that you can carve off that is not rotten.

Decision trees also work the same way. For example, let’s take the iris dataset. To make things simple, let’s just focus on

  • setosa and versicolor
  • sepal length and sepal width.

If you are asked to carve out one species from another using just horizontal and vertical lines, how would you do it ? It’s not an easy job to do it efficiently. Probably, we would do it something like this.

What were we basing our decisions (cut-off points) on ? Visually, we were essentially eye-balling to minimize the mix of species(or maximize the grouping of a single species) in such a way that more of a specific species fell on one side than the other.

Decision tree algorithms do just this – except that they use a bit of math to do the same. Scikit learn provides two cost functions for this

  • Gini Index ( default )
  • Entropy

We will start with the basic implementation and then we will focus on understand Gini Index in a bit more detail.


Implementation

from sklearn import datasets

iris        = datasets.load_iris()
iris_sepal_data   = iris.data[0:100,0:2]
iris_sepal_target = iris.target[0:100]
from sklearn import tree

classifier = tree.DecisionTreeClassifier()
model      = classifier.fit(iris_sepal_data, iris_sepal_target)

y_predict = model.predict(iris_sepal_data)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print ( "confusion matrix = \n" , confusion_matrix(iris_sepal_target, y_predict) )
print ( "accuracy score = ",accuracy_score(iris_sepal_target,y_predict) )

confusion matrix = 
 [[50  0]
 [ 0 50]]
accuracy score =  1.0


Visualization

One of the biggest advantages of Decision Trees is that the whole process is very intuitive to humans. It is more or less like a white-box ( as opposed to other methods like Neural Nets that are like blackboxes – We just can’t make sense of the weights and layers ). A useful method to understand Decision Trees is to visualize them. To do that, we have to install the graphviz package. Let’s do that first.

Install graphviz package on your Windows or Mac Machine and add it to the path if necessary. Here are the links to download the software.

import graphviz

# export the dot data file into a variable
dot_data = tree.export_graphviz(classifier, out_file=None,
                                feature_names=iris.feature_names[0:2],  
                                class_names=iris.target_names[0:2],  
                                filled=True, rounded=True,  
                                special_characters=True) 

# create the graph
graph = graphviz.Source(dot_data) 

# and display the graph
graph

There we go – graphviz has created the graph of how the decision tree algorithm has actually solved the problem – step by step.


Gini Index

Let’s analyze how the Decision Tree algorith has made these decisions. First, let’s create a scatter plot of our data.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(iris_sepal_data[0:50,0],iris_sepal_data[0:50,1], color="red",alpha=0.5,label="setosa")
plt.scatter(iris_sepal_data[50:100,0],iris_sepal_data[50:100,1], color="green",alpha=0.5,label="versicolor")
plt.legend()
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
# plt.axvline(6.0,color="red",linewidth=5,alpha=0.5)
# plt.axhline(2.8,color="red",linewidth=4,alpha=0.5)
# plt.axvline(5.35,color="red",linewidth=3,alpha=0.5)
# plt.axhline(3.2,color="red",linewidth=2,alpha=0.5)
Text(0, 0.5, 'Sepal Width')

The key parameters used by Decision Tree are either of the following

  • gini index
  • entropy

By default DecisionTreeClassifier uses the gini index to calculate the cut-offs. Let’s focus on the gini index cost function.

Let’s look at the first cut ,

sepal length (cm) < = 5.45

Let’s do some calculations by hand. It will give us a better understanding of what is going on under the hood.

Formula to calculate Gini index is

where pi is the probability of occurance of the i’th class. In our case, we have just 2 classes.

  • setosa
  • versicolor

The above visual demonstrates how the calculations have been done.

Initial gini index.

Gini indices after the first cut has been made.

so, gini index after the split is

0.208 + 0.183 = 0.391

which is less than the original gini index that we started with – 0.5

Now, the question arises, why did the first cut happen at sepal length <= 5.45 ? Why not at 6.0 ? To understand this, let’s actually make a cut at sepal length <= 6.0 and re-calculate the gini indices.

  • Identify the target counts by class
iris_sepal_target[iris_sepal_data[:,0] <=6.0]
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1])
import numpy as np

np.unique(iris_sepal_target[iris_sepal_data[:,0] <=6.0], 
          return_counts = True)
(array([0, 1]), array([50, 30]))
  • Calculate the Gini Index

The gini index at the new cut-off sepal length <= 6.0 is 0.468. It is not much different from where we initially started (0.5). By now, you should be able to understand the reasons behind the classifier’s decision points.


Challenge

Try to calculate the gini index by hand(like above) when the sepal width <=2.75


Here is a visual of how decision tree algorithm has eventually solved the problem.

Now that we understand how decision trees work, let’s try and predict some data. Let’s first split our data into train/test datasets.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (iris_sepal_data, iris_sepal_target)

from sklearn import tree

classifier = tree.DecisionTreeClassifier()
model      = classifier.fit(X_train, y_train)
y_predict = model.predict(X_test)
from sklearn.metrics import confusion_matrix
from sklearn.metrics import accuracy_score

print ( "confusion matrix = \n" , confusion_matrix(y_test, y_predict) )
print ( "accuracy score = ",accuracy_score(y_test,y_predict) )

confusion matrix = 
 [[14  0]
 [ 1 10]]
accuracy score =  0.96

That’s an accuracy score of 88%. Pretty decent.


Decision Trees for Regression

Although decision trees are mostly used for classification problems, you can use them for regression as well.

Let’s try to fit the Boston Housing dataset with decision trees. Just to make things simple, let’s just use the LSTAT predictor to predict the target.

from sklearn import datasets
from sklearn.model_selection import train_test_split

boston        = datasets.load_boston()
#LSTAT is the 12th feature
# Use boston.feature_names to list all the features
boston_data   = boston.data[:,12] 

boston_target = boston.target

# Split the data into train/test datasets
X_train, X_test, y_train, y_test = train_test_split(boston_data, boston_target, test_size=0.2, random_state=0)
from sklearn.tree import DecisionTreeRegressor

regressor = DecisionTreeRegressor(random_state=0)
model     = regressor.fit(X_train.reshape(-1,1),y_train)

y_predict = regressor.predict(X_test.reshape(-1,1))
# Use the in-built "score" function of the regressor to calculate the R-squared
model.score(X_test.reshape(-1,1),y_test)
0.42164272169877925

Let’s plot the predictions on top of the data to visually see how well the prediction does in comparision to the actual data.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(X_test,y_test,alpha=0.5, label = "actual")
plt.scatter(X_test,y_predict,label="predicted")
plt.legend()

Just for fun, let’s compare the r-square score of the decision tree regressor against linear regression.

from sklearn.linear_model import LinearRegression

model   = LinearRegression().fit(X_train.reshape(-1,1),y_train)
model.score(X_test.reshape(-1,1),y_test)
0.43095672846187605

R2 score – Model

  • 0.42 – Decision Tree Regressor
  • 0.43 – Linear Regressor

Not too far off – right ?


Overfitting

One of the main drawbacks of Decision Trees is overfitting. We can very well observe that the model fits the training data 100%, but when it comes to test data, there would be a huge variation. Such a large variation is something that you do NOT observe in other models – say linear regression. Let’s try and fit the iris data again (this time with all the 3 species).

from sklearn import datasets

dataset        = datasets.load_iris()
from sklearn.model_selection import train_test_split
from sklearn import tree

test_score     = []
train_score = []

for i in range(1,100) :
    X_train, X_test, y_train, y_test = train_test_split (dataset.data, dataset.target)

    classifier = tree.DecisionTreeClassifier()
    model      = classifier.fit(X_train, y_train)
    
    test_score.append(model.score(X_test,y_test))
    train_score.append(model.score(X_train,y_train))
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(list(range(1,100)), train_score)
plt.scatter(list(range(1,100)), test_score)

As you can see, the training data is always a 100% fit while the test data varies all over. In this case, the model is not all that complicated – so the variation is not all that huge. However, if the data is huge and there are many borderline cases, the variation could be difficult to explain for the model. Essentially, the model is learning all the noise in the data and so is not able to predict the test data well.


Pruning – Solution to Overfitting

Overfitting is basically a problem where the model tries to fit all of the training data. Since there are many borderline cases, it is not practical to fit all the data points for any ML model. In order to avoid this, we have to prune (cut off some of it’s branches) the tree to make it an a better fit for the training data – rather than a 100% fit. There are 2 ways to prune a tree.

  • Pre-Pruning – Prune the decision tree while it is being created.
  • Post-Pruning – Prune the decision tree after the entire tree has been created.

Scikit Learn only supports Pre-Pruning at the moment. So, we will only focus on that.


Tree Depth

import graphviz

# export the dot data file into a variable
dot_data = tree.export_graphviz(classifier, out_file=None,
                                feature_names=dataset.feature_names,  
                                class_names=dataset.target_names,  
                                filled=True, rounded=True,  
                                special_characters=True) 

# create the graph
graph = graphviz.Source(dot_data) 

# and display the graph
graph

In this decision tree, after a tree depth of 3, there is no real value addition. It’s basically nitpicking at the small numbers – and that’s exactly what is leading to overfitting. What is we can restrict the tree to just 3 levels of depth ?

The DecisionTreeClassifier has a parameter called max_depth. Set it to 3.

Now, let’s run the same test and see how different the training and the test data looks.

from sklearn.model_selection import train_test_split
from sklearn import tree

test_score     = []
train_score = []

for i in range(1,100) :
    X_train, X_test, y_train, y_test = train_test_split (dataset.data, dataset.target)

    classifier = tree.DecisionTreeClassifier(max_depth = 3)
    model      = classifier.fit(X_train, y_train)

    test_score.append(model.score(X_test,y_test))
    train_score.append(model.score(X_train,y_train))
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(list(range(1,100)), train_score)
plt.scatter(list(range(1,100)), test_score)

Optimum Tree Depth

In the previous case, we just randomly cut-off the tree at a depth of 3 levels. Typically, for simple trees, looking at a tree you would be able to tell roughly at what depth the predictive power starts to wear off. If you have a complex tree though, you might want to systematically determine this. You can draw a curve and find out where exactly the accuracy of the test data starts to wear off.

from sklearn.model_selection import train_test_split
from sklearn import tree

X_train, X_test, y_train, y_test = train_test_split (wine.data, wine.target)

training_score = []
test_score     = []

for i in range(1,20) :
    classifier = tree.DecisionTreeClassifier(max_depth=i)
    model      = classifier.fit(X_train, y_train)
    training_score.append(model.score(X_train,y_train) )
    test_score.append(model.score(X_test,y_test) )
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(list(range(1,20)), training_score, label="training scores")
plt.scatter(list(range(1,20)), test_score,label="test scores")
plt.legend()

Roughly at around a depth of 3, the test scores start to drop off.

from sklearn.model_selection import train_test_split
from sklearn import tree

X_train, X_test, y_train, y_test = train_test_split (wine.data, wine.target)

training_score = []
test_score     = []

for i in range(5,50) :
    classifier = tree.DecisionTreeClassifier(min_samples_split=i)
    model      = classifier.fit(X_train, y_train)
    training_score.append(model.score(X_train,y_train) )
    test_score.append(model.score(X_test,y_test) )
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(list(range(5,50)), training_score, label="training scores")
plt.scatter(list(range(5,50)), test_score,label="test scores")
plt.legend()
import graphviz

# export the dot data file into a variable
dot_data = tree.export_graphviz(classifier, out_file=None,
                                feature_names=wine.feature_names,  
                                class_names=wine.target_names,  
                                filled=True, rounded=True,  
                                special_characters=True) 

# create the graph
graph = graphviz.Source(dot_data) 

# and display the graph
graph

Impurity

Another parameter that can be used to control the tree growth is the decrease in impurity. In the examples above, we have used gini index to measure the impurity. Say we set a limit on this to 0.1. So, any further division would not be allowed if the gini index does not go down by 0.1. This can also be used to greatly limit the tree size.

from sklearn.model_selection import train_test_split
from sklearn import tree

test_score     = []
train_score = []

for i in range(1,100) :
    X_train, X_test, y_train, y_test = train_test_split (dataset.data, dataset.target)

    classifier = tree.DecisionTreeClassifier(min_impurity_decrease = 0.1)
    model      = classifier.fit(X_train, y_train)
    
    test_score.append(model.score(X_test,y_test))
    train_score.append(model.score(X_train,y_train))
X_train, X_test, y_train, y_test = train_test_split (dataset.data, dataset.target)

classifier = tree.DecisionTreeClassifier(min_impurity_decrease = 0.1)
model      = classifier.fit(X_train, y_train)

import graphviz

# export the dot data file into a variable
dot_data = tree.export_graphviz(classifier, out_file=None,
                                feature_names=dataset.feature_names,  
                                class_names=dataset.target_names,  
                                filled=True, rounded=True,  
                                special_characters=True) 

# create the graph
graph = graphviz.Source(dot_data) 

# and display the graph
graph