Ensemble Methods
Contents
What is an Ensemble
Here is one of the meanings of the the word Ensemble
a set of things that go together to form a whole
In the context of Machine Learning, Ensemble represents a technique in which a group of models or iterations are performed on a dataset to predict the data. It’s like taking a second or third opinion from a different doctor for diagnosis.

Expanding on the example above, imagine multiple Machine Learning algorithms acting on the same training data set to either vote (classification) or calculate average (regression) on the result.

Essentially, what we are trying to do is to reduce the variance and give a more predictive result (avoid overfitting).
Two heads are better than one – In essence this is the philosophy of Ensemble learning.
Sometimes, it is not necessarily the average of multiple algorithms – It could simply be an average of the same algorithm over a random subset of the training data, run multiple times and an average taken. For example, here is an example of the same with just the Decision Tree algorithm.

Types of Ensemble Methods
There are multiple ensemble methods and we will cover just the following using scikit learn.
- Bagging
- Boosting
Bagging
Bagging stands for Bootstrap Aggregating. Bagging refers to a family of ensemble methods that rely on averaging the performance of a single model over a randomly drawn sample from the training data set. An example for this is Random Forest which we will discuss below.
Boosting
Boosting is almost the same as bagging , except the dataset is not random. Each time the dataset is created, it is modified by adding more of the data points that failed with the previous model. This way, the performance of each of the subsequent models increase significantly by specifically learning from the failed data points (rather than the straight forward ones). Although boosting tends to overfit, if you want the error to significantly reduce for a dataset, you could go for boosting.
Bagging Implementation
Bagging
Let’s implement Bagging on KNN to classify Iris dataset.
# Load the iris dataset
from sklearn import datasets
iris = datasets.load_iris()
# split into train and test datasets
from sklearn.model_selection import train_test_split
# just use the sepal data
X_train, X_test, y_train, y_test = train_test_split(iris.data[:,0:2],iris.target)
# Model the data set using Bagging Classifier
from sklearn.ensemble import BaggingClassifier
from sklearn.neighbors import KNeighborsClassifier
classifier = BaggingClassifier(base_estimator = KNeighborsClassifier(),
max_samples = 10,
n_estimators = 100)
classifier.fit(X_train,y_train)
BaggingClassifier(base_estimator=KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
metric_params=None, n_jobs=None, n_neighbors=5, p=2,
weights='uniform'),
bootstrap=True, bootstrap_features=False, max_features=1.0,
max_samples=10, n_estimators=100, n_jobs=None, oob_score=False,
random_state=None, verbose=0, warm_start=False)
# Calculate the score
classifier.score(X_test,y_test)
0.7631578947368421
classifier_knn = KNeighborsClassifier()
classifier_knn.fit(X_train,y_train)
classifier_knn.score(X_test,y_test)
0.7631578947368421
import numpy as np
def make_meshgrid(x, y, h=.02):
x_min, x_max = x.min() - 1, x.max() + 1
y_min, y_max = y.min() - 1, y.max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
return xx, yy
def plot_contours(ax, clf, xx, yy, **params):
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
out = ax.contourf(xx, yy, Z, **params)
return out
X0, X1 = iris.data[:,0], iris.data[:, 1]
# Pass the data. make_meshgrid will automatically identify the min and max points to draw the grid
xx, yy = make_meshgrid(X0, X1)
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200
# plot the meshgrid
plot_contours(plt, classifier, xx, yy,
cmap=plt.cm.coolwarm,
alpha=0.8)
# plot the actual data points for versicolor and virginica
plt.scatter(X0, X1, c=iris.target,
cmap=plt.cm.coolwarm,
s=20, edgecolors='k',
alpha=0.2)

# plot the meshgrid
plot_contours(plt, classifier_knn, xx, yy,
cmap=plt.cm.coolwarm,
alpha=0.8)
# plot the actual data points for versicolor and virginica
plt.scatter(X0, X1, c=iris.target,
cmap=plt.cm.coolwarm,
s=20, edgecolors='k',
alpha=0.2)

Here is a visual comparing ensemble of KNN classifier vs a single KNN classifier. You can see that the overfitting in the case of a single KNN classifier has been greatly reduced in the ensemble model.

# Calculate the score
classifier.score(X_test,y_test)
0.6578947368421053
classifier_knn.score(X_test,y_test)
0.7631578947368421
The accuracy of the standalone KNN modeler seems to be better than the ensemble. That is because we have a small dataset. Once the model hits a large real world dataset, the ensemble performs much better – specifically with variance.
Random Forest
Random Forest is also a type of bagging based ensemble model, but specifically designed for decision trees. The idea is really simple
- A forest is a bunch of trees
- Each tree (decision tree) models the training data
- It is called random forest because the trees are modeling off of a random subset of the training data.
from sklearn.ensemble import RandomForestClassifier
classifier_rf = RandomForestClassifier(n_estimators=20)
classifier_rf.fit(X_train,y_train)
RandomForestClassifier(bootstrap=True, class_weight=None, criterion='gini',
max_depth=None, max_features='auto', max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=20, n_jobs=None,
oob_score=False, random_state=None, verbose=0,
warm_start=False)
# plot the meshgrid
plot_contours(plt, classifier_rf, xx, yy,
cmap=plt.cm.coolwarm,
alpha=0.8)
# plot the actual data points for versicolor and virginica
plt.scatter(X0, X1, c=iris.target,
cmap=plt.cm.coolwarm,
s=20, edgecolors='k',
alpha=0.2)

# plot the meshgrid
plot_contours(plt, classifier_dt, xx, yy,
cmap=plt.cm.coolwarm,
alpha=0.8)
# plot the actual data points for versicolor and virginica
plt.scatter(X0, X1, c=iris.target,
cmap=plt.cm.coolwarm,
s=20, edgecolors='k',
alpha=0.2)

Here is a quick visual of how the standard decision tree compares to the Random Forest(ensemble).

Let’s look at the accuracy score of decision tree vs random forest.
# random forest classifier score
classifier_rf.score(X_test,y_test)
0.6842105263157895
Let’s compare this with a standard decision tree with the standard parameters(criteria=”gini” and no max_depth or min_impurity_decrease parameters set).
from sklearn import tree
classifier_dt = tree.DecisionTreeClassifier()
classifier_dt.fit(X_train,y_train)
# standard decision tree classifier score
classifier_dt.score(X_test,y_test)
0.631578947368421
In this case the accuracy seems to have increased between the standard decision tree and the Random Forest classifier in favor of the Random Forest. However, what is important is that overfitting is reduced which results in less variance on the test dataset and in data from the real world that can be pretty varied.
Boosting Implementation
We will cover three of the most commonly used boosting methods.
- Adaboost
- Gradient Boost
Adaboost
Adaboost is probably the most primitive of the boosting methods. It is mostly commonly used with the base algorithm as the decision tree. However, unlike a full tree, Adaboost uses only stumps (just one branch with a decision tree) which we will see below.
In the ramdom forest example above, we let a bunch of trees (forest) grow to their full extent without pruning them. However, Adaboost works differently. The principle is that a bunch of weak learners come together to produce a stronger model. Keeping in line with that approach, instead of a full tree, Adaboost works with stumps. What is a stump ? A stump is the smallest tree possible – just one node and 2 leaves.

A random forest is a forest of trees (fully grown) while adaboost is a forest of stumps. For example, in the picture above, the stump just divides the dataset into two parts
- setosa length > 4.5
- setosa length <= 4.5
Your immediate question might be, what good is a stump ? It just divides the dataset into 2 parts – and that’s it. How can a forest of stumps help build a model ?
Adaboost specifically wants the individual learners to be weak learners. Because the key principal behind adaboost is that weak learners come together to form strong learners. Let’s see how exactly Adaboost does it.
In a random forest each tree is made independent of each other. The order of the trees doesn’t matter. Once all the trees are made (based on subsets of data), each tree has a vote (in the case of classification) and the class that most trees in the forest vote is chosen a winner.
In contrast, not all stumps in the adaboost algorithm have equal weightage.

But before we understand this, we have to understand what I meant by “weak learners come together to form strong learners”. Let’s go back to our iris dataset. To start with all the rows have equal weightage. For example, if there are 10 rows, each row gets 1/10th of a weight to start with.

Since there are 4 predictors, there are 4 possible stumps to start with. And each of the stumps does it best to come up with it’s best possible gini index. However, only one of them wins. In the following case, the stump that splits on petal length wins.

Obviously, that stump can’t be 100% right. So, some of the rows would be wrongly classified.

These wrongly classified rows are given an increased weightage. The amount of increase is based on a specific formula that we can best leave to the algorithm at this point.

Create a new dataset of the same size as the first, but select the data based on the weights. So, the rows with more weights(errored out rows based on the first stump) will get more preference. Ultimately, the new dataset will have more rows that have errored out based on the first stump’s classification.

Here is the the trick – since the new dataset has more rows that the first stump could not correctly classify, the second stump (based on this new data set) will correctly be able to identify the wrongly classified ones (gini index by definition does a good job on homogenous bunches of data).
The weights in the new dataset are reset.

And classification starts again with a bunch of stumps.

This sequence continues until a set of winning stumps are created each with particular weightage. The sume of the weighted vote determines the final classification.

Let’s see AdaBoost in action.
from sklearn.ensemble import AdaBoostClassifier
classifier_adb = AdaBoostClassifier(n_estimators=100,random_state=100,learning_rate=0.1)
classifier_adb.fit(X_train,y_train)
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
learning_rate=0.1, n_estimators=100, random_state=100)
X0, X1 = iris.data[:,0], iris.data[:, 1]
# Pass the data. make_meshgrid will automatically identify the min and max points to draw the grid
xx, yy = make_meshgrid(X0, X1)
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200
# plot the meshgrid
plot_contours(plt, classifier_adb, xx, yy,
cmap=plt.cm.coolwarm,
alpha=0.8)
# plot the actual data points for versicolor and virginica
plt.scatter(X0, X1, c=iris.target,
cmap=plt.cm.coolwarm,
s=20, edgecolors='k',
alpha=0.2)

classifier_adb.score(X_test,y_test)
0.6842105263157895
The score of the classifier is almost similar to the Random Forest, but a bit higher than the regular decision tree score.
Gradient Boost
Gradient Boost is essentially a combination of Decision trees with some kind of Residuals minimizing algorithm (It could be Ordinary Least Squares or Gradient Descent). As long as you understand the concept of residuals and how to minimize them, we are good to go to understand Gradient boost.
So, by definition, Gradient Boost is a good for for regression problems. However, it can be adapted to Classification problems using the logit/expit function ( used in Logistic Regression). So, let’s get started with a regression problem, say the Boston Housting dataset.
Regression

We first start off with an approximiation – typically in the case of regression it is the average. Calculate the residuals of each of the target variable against the average value to start with.

Once you have done that, now, instead of fitting the target variable, fit the residuals. This is almost like doing a linear regression with Gradient Descent, except we are using decision trees for the same.

We would have to do this operation over and over (until a set number of iterations or until the reductions in residuals is not worth it). Let’s go over this step by step.
After running the decision tree to fit the residuals, calculate the target value.

For example, here is how the first row is predicted.

And here is a sample of how all of the sample rows could look like.

The reason why we are showing all of the values here is to prove a point – that the prediction is getting one step closer to the actual values (from the average).

The next step is to iterate over the same process again. This time instead of calculating the residuals like before (actual – average), we use the new prediction to compute the residuals. If you compare the new predictions and residuals to the old ones, you can definitely see a trend getting the prediction moving towards the actual value and the residuals slowly decreasing.

from sklearn import datasets
boston = datasets.load_boston()
boston.feature_names
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD',
'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston.data[:,[0,4,5]],
boston.target, test_size=0.2, random_state=100)
from sklearn.ensemble import GradientBoostingRegressor
regressor = GradientBoostingRegressor(n_estimators = 100, max_depth = 4, learning_rate = 0.05)
regressor.fit(X_train,y_train)
GradientBoostingRegressor(alpha=0.9, criterion='friedman_mse', init=None,
learning_rate=0.05, loss='ls', max_depth=4, max_features=None,
max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, min_samples_leaf=1,
min_samples_split=2, min_weight_fraction_leaf=0.0,
n_estimators=100, n_iter_no_change=None, presort='auto',
random_state=None, subsample=1.0, tol=0.0001,
validation_fraction=0.1, verbose=0, warm_start=False)
You can also find out how important the model feels the features are. The features we have used in the model are
- CRIM – Crime Rate
- NOX – Nitrous oxide in the air
- RM – Number of rooms
regressor.feature_importances_
array([0.1299425, 0.2058841, 0.6641734])
This is the importance of these features according to the model
- CRIM – 0.12
- NOX – 0.20
- RM – 0.66
Let’s calculate the r2 score of the regressor.
# r2 score
regressor.score(X_test,y_test)
0.8047355573273204
That’s an r2 score of 0.81. Let’s compare this to linear regression.
from sklearn.linear_model import LinearRegression
regressor_lin = LinearRegression().fit(X_train,y_train)
regressor_lin.score(X_test,y_test)
0.5941434559407395
As you can see, the Gradient Tree Boosting is way more powerful than simple linear regression.
Classification
What about classification ? Since we are trying to fit residuals, the way to do classification is via the logistic regression model – using the logit function. So, to classify the target we calculate the log odds(using the logit function) and then transform it to a value between 0 and 1 using the expit function.
# Load the iris dataset
from sklearn import datasets
iris = datasets.load_iris()
# split into train and test datasets
from sklearn.model_selection import train_test_split
# just use the sepal data
X_train, X_test, y_train, y_test = train_test_split(iris.data[:,0:2],iris.target)
from sklearn.ensemble import GradientBoostingClassifier
classifier_gb = GradientBoostingClassifier(n_estimators = 200, max_depth = 4, learning_rate = 0.01)
classifier_gb.fit(X_train,y_train)
GradientBoostingClassifier(criterion='friedman_mse', init=None,
learning_rate=0.01, loss='deviance', max_depth=4,
max_features=None, max_leaf_nodes=None,
min_impurity_decrease=0.0, min_impurity_split=None,
min_samples_leaf=1, min_samples_split=2,
min_weight_fraction_leaf=0.0, n_estimators=200,
n_iter_no_change=None, presort='auto', random_state=None,
subsample=1.0, tol=0.0001, validation_fraction=0.1,
verbose=0, warm_start=False)
classifier_gb.score(X_test,y_test)
0.7631578947368421
And here is how the classification looks visually.
X0, X1 = iris.data[:,0], iris.data[:, 1]
# Pass the data. make_meshgrid will automatically identify the min and max points to draw the grid
xx, yy = make_meshgrid(X0, X1)
import matplotlib as mpl
import matplotlib.pyplot as plt
%matplotlib inline
mpl.rcParams['figure.dpi'] = 200
# plot the meshgrid
plot_contours(plt, classifier_gb, xx, yy,
cmap=plt.cm.coolwarm,
alpha=0.8)
# plot the actual data points for versicolor and virginica
plt.scatter(X0, X1, c=iris.target,
cmap=plt.cm.coolwarm,
s=20, edgecolors='k',
alpha=0.2)

classifier_gb.feature_importances
array([0.68478571, 0.31521429])