# Support Vector Machines

### Contents

- What is SVM
- Support Vectors
- Kernels
- Hyperplane
- Performance Tuning
- SVM for Regression

The name sounds very complicated – and true to its name, the concept is a bit involved ( in terms of its math ). Luckily, the implementation of the algorithm (either in R or in Python) takes care of all the complexity for us. So, all we have to focus is on the basic concepts, what it can do for us and some of the tuning parameters. Let’s get straight into it.

### What is SVM

Let’s try and understand SVM from a layman’s perspective. Imagine two Domino’s Franchises in a city – Say one is in zip code 98001 and the other is in 98201. Say these are the only franchises across all the zip codes starting 98001 to 98300. Practically, I don’t think you can “not find” a Domino’s shop across such large distance, but just play along for a bit.

If a call came in from a zip code say 98002, probably the call should be routed to Franchise 1. Why ? proximity to 98001 zip code. Similarly, if a call came in from the zip code 98202, it should be routed to Franchise 2 – again distance is the key. Essentially, this is how KNN works, right ? If you have not learn’t K Nearest Neighbor ( KNN ) , I suggest you learn KNN in R before you proceed further. If you insist on not doing it, that’s fine too.

What is SVM trying to solve that KNN isn’t ? Well, they can solve the same problems, probably with the same level of accuracy ( depending on the dataset ), but they take very different approaches. For example,

- KNN is a lazy learning algorithm. Meaning – based on the training data, KNN does not determine any parameters. It just uses the data at run time ( when you start giving it the test data ) by computing distances on the training data on the fly. This is time consuming.
- While computing the distances and sorting them, performance is an issue – especially on a large dataset ( Sorting large dataset is mathematically performance intensive )
- KNN is very sensitive to outliers.
- In higher dimensions ( more predictors ), KNN suffers from performance and accuracy issues.

### SVM is eager leaning (not lazy learning)

Unlike KNN which is a lazy learning algorithm, SVM actually LEARNS from the training data and can instantly give answers to the test data. Let’s look at the pizza problem.

What we want to do here is essentially draw a line and say that all zip codes to the top should be routed from Franchise 1 and all zip codes to the bottom should be routed from Franchise 2. This is unlike KNN where the distances are calculated on the fly. In SVM, we are somehow able to draw a line and once that is done, no more distance calculation and sorting are required. But, how does SVM draw the line ? Based on distances of course, but unlike KNN, SVM uses the training data to determine parameters and remember those parameters ( In this case the straight line )

This line is basically based off of “centering” itself between the closest points on either side. Let’s take the iris dataset and code this to get a better picture.

### Case 1 – Linear separation possible in 2D

As you should know by now, the iris dataset comprises of 4 predictors and 1 response variable. Here is a sample.

from sklearn import datasets iris = datasets.load_iris() iris_data = iris.data iris_target = iris.target iris_data[0:5,:]

array([[5.1, 3.5, 1.4, 0.2], [4.9, 3. , 1.4, 0.2], [4.7, 3.2, 1.3, 0.2], [4.6, 3.1, 1.5, 0.2], [5. , 3.6, 1.4, 0.2]])

Just to simplify things, let’s drop the 3rd and 4th columns ( Petals ) and just focus on Petals.

iris_sepal = iris_data[:,0:2] iris_sepal[0:5,]

array([[5.1, 3.5], [4.9, 3. ], [4.7, 3.2], [4.6, 3.1], [5. , 3.6]])

If we plot this, we should be clearly able to separate the setosa species from the others based on a simple straight line.

import matplotlib.pyplot as plt %matplotlib inline plt.scatter(iris_sepal[0:50,0],iris_sepal[0:50,1],color="red",label="setosa") plt.scatter(iris_sepal[50:100,0],iris_sepal[50:100,1],color="green",label="versicolor") plt.scatter(iris_sepal[100:150,0],iris_sepal[100:150,1],color="blue",label="virginica") plt.plot([4.5,6.5],[2.2,4.5]) plt.legend()

<matplotlib.legend.Legend at 0x122b5cf0>

This is very much similar to the pizza example – right ? Let’s make this more simple, by removing one of the species in the bottom half of the picture.

plt.scatter(iris_sepal[0:50,0],iris_sepal[0:50,1],color="red",label="setosa") plt.scatter(iris_sepal[50:100,0],iris_sepal[50:100,1],color="green",label="versicolor") plt.xlabel("sepal length") plt.ylabel("sepal width") plt.plot([4.5,6.5],[2.2,4.5]) plt.legend()

<matplotlib.legend.Legend at 0x140f6090>

from sklearn import svm classifier = svm.SVC(kernel="linear") model = classifier.fit(iris_sepal[0:100,:],iris_target[0:100])

y_predict = model.predict(iris_sepal[0:100,:])

from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score print ( "confusion matrix = \n" , confusion_matrix(iris_target[0:100], y_predict) ) print ( "accuracy score = ",accuracy_score(iris_target[0:100],y_predict) )

confusion matrix = [[50 0] [ 0 50]] accuracy score = 1.0

That’s a 100% accuracy which is to be expected anyway (as we can visually see a straight line clearly separating the data points).

classifier.support_vectors_[0:5,:]

array([[4.9, 3. ], [5.7, 3.8], [5.4, 3.4], [5. , 3. ], [5.4, 3.4]])

plt.scatter(iris_sepal[0:50,0],iris_sepal[0:50,1], color="red",label="setosa",alpha=0.3) plt.scatter(iris_sepal[50:100,0],iris_sepal[50:100,1], color="green",label="versicolor",alpha=0.3) plt.plot([4.5,6.5],[2.2,4.5]) plt.scatter(classifier.support_vectors_[:,0], classifier.support_vectors_[:,1], marker="+",color="purple",label="support vectors") plt.legend()

<matplotlib.legend.Legend at 0x12f94250>

Ths straight line is basically centering itself based on the points that the kernel has identified as **support vectors** – shown in the plot above with **+** marks.

classifier.kernel

'linear'

### Case 2 – Clear Linear Separation not possible in 2D

The previous example was very simple. Let’s take a more complicated example. One where SVM shines.

Say each point is identified by a GPS location. Now, given a new location, can you classify it as urban or sub-urban ? Of course, you can use KNN for this. However, there is going to be a bit of an overlap – and of course the issues that we discussed earlier. SVM could potentially be an alternative.Let’s see how.

Of course in this case linear modeling does not work – for the simple reason that the boundary is non-linear. So, how does SVM solve the problem ? Here goes.. Don’t be scared.

SVM projects the data into higher dimensional space to achieve linear separation

I know, that’s a lot to grasp. Let’s break it down visually.

Let’s imagine an A4 sheet of paper. and start to pull the corners of the paper upward. Imagine all the GPS data points on the A4 sheet of paper. and when we pull the corners upward and look at the whole paper sideways, what you see is a visual like this.

The corners are being pulled in the “z” direction. The urban homes are in the center of the paper and the sub-urban homes are on the edges. Since you are pulling the edges on the z-axis, the sub-urban homes are higher up on the z axis than the urban homes.

Now, we can just draw a plan that clearly separates the sub-urban homes from the urban homes as shown below.

If you look at this picture from the top ( looking in the z-axis direction from the top ), you would be able to visualize a picture like this.

What is more important is the nature of the plane. Although the separating boundary between the points is a jagged circle, the separating plane is just a square plane that achieves linear separation between the two datasets based on the z- distance. Essentially, a non-linearly separable problem just became linearly separable ( in the z-axis ) by projecting the points into the 3rd dimension ( z-axis ). This is called as a Hyper plane – which we are going to talk about next.

### Significant overlap in the 2D space

Some times even if we project the data into a 3rd dimension ( or higher dimension ) there might not be a clear separation of data. We can very well see it in our simple iris datase for the other species – Virginica and Versicolor – These are very interspersed.

plt.scatter(iris_sepal[50:100,0],iris_sepal[50:100,1],color="red",label="versicolor") plt.scatter(iris_sepal[100:150,0],iris_sepal[100:150,1],color="green",label="virginica") plt.xlabel("sepal length") plt.ylabel("sepal width") plt.legend()

<matplotlib.legend.Legend at 0x1406c210>

SVM will still be able to do a decent job with the separation. Let’s code it.

from sklearn import svm classifier = svm.SVC(kernel="linear") model = classifier.fit(iris_sepal[50:150,:],iris_target[50:150])

y_predict = model.predict(iris_sepal[50:150,:])

from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score print ( "confusion matrix = \n" , confusion_matrix(iris_target[50:150], y_predict) ) print ( "accuracy score = ",accuracy_score(iris_target[50:150],y_predict) )

confusion matrix = [[38 12] [15 35]] accuracy score = 0.73

That’s an accuracy score of 73% – not bad, given the overlap. To plot the decision line/curve though, we need to take the help of NumPy’s meshgrid. Readymade examples of these are available on Scikit Learn’s website –

**Make a function for the meshgrid**

import numpy as np

def make_meshgrid(x, y, h=.02): x_min, x_max = x.min() - 1, x.max() + 1 y_min, y_max = y.min() - 1, y.max() + 1 xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h)) return xx, yy

**Plot the entire meshgrid**

def plot_contours(ax, clf, xx, yy, **params): Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]) Z = Z.reshape(xx.shape) out = ax.contourf(xx, yy, Z, **params) return out

**Identify the data**

X0, X1 = iris_sepal[50:150, 0], iris_sepal[50:150, 1] # Pass the data. make_meshgrid will automatically identify the min and max points to draw the grid xx, yy = make_meshgrid(X0, X1)

**Plot the meshgrid points, actual data points and support vectors**

# plot the meshgrid plot_contours(plt, classifier, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8) # plot the actual data points for versicolor and virginica plt.scatter(X0, X1, c=iris_target[50:150], cmap=plt.cm.coolwarm, s=20, edgecolors='k', alpha=0.2) # plot the support vectors to identify how the linear separator works plt.scatter(classifier.support_vectors_[:,0], classifier.support_vectors_[:,1], marker="+",color="purple",label="support vectors")

<matplotlib.collections.PathCollection at 0x154f6610>

This is based on a **linear kernel**. There are a couple more options for specifying the kernel. Scikit learn by default provides the following kernels in SVM.

- linear
- rbf – Radial Basis Function
- poly – Polynomial

Let’s actually try all of these to see how well the separator works.

classifier = svm.SVC(kernel="rbf", gamma="auto") model = classifier.fit(iris_sepal[50:150,:],iris_target[50:150]) y_predict = model.predict(iris_sepal[50:150,:]) from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score print ( "confusion matrix = \n" , confusion_matrix(iris_target[50:150], y_predict) ) print ( "accuracy score = ",accuracy_score(iris_target[50:150],y_predict) )

confusion matrix = [[38 12] [14 36]] accuracy score = 0.74

# plot the meshgrid plot_contours(plt, classifier, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8) # plot the actual data points for versicolor and virginica plt.scatter(X0, X1, c=iris_target[50:150], cmap=plt.cm.coolwarm, s=20, edgecolors='k', alpha=0.2) # plot the support vectors to identify how the linear separator works plt.scatter(classifier.support_vectors_[:,0], classifier.support_vectors_[:,1], marker="+",color="purple",label="support vectors")

<matplotlib.collections.PathCollection at 0x155caa30>

classifier = svm.SVC(kernel='poly', degree=3, gamma='auto') model = classifier.fit(iris_sepal[50:150,:],iris_target[50:150]) y_predict = model.predict(iris_sepal[50:150,:]) from sklearn.metrics import confusion_matrix from sklearn.metrics import accuracy_score print ( "confusion matrix = \n" , confusion_matrix(iris_target[50:150], y_predict) ) print ( "accuracy score = ",accuracy_score(iris_target[50:150],y_predict) )

confusion matrix = [[38 12] [16 34]] accuracy score = 0.72

# plot the meshgrid plot_contours(plt, classifier, xx, yy, cmap=plt.cm.coolwarm, alpha=0.8) # plot the actual data points for versicolor and virginica plt.scatter(X0, X1, c=iris_target[50:150], cmap=plt.cm.coolwarm, s=20, edgecolors='k', alpha=0.2) # plot the support vectors to identify how the linear separator works plt.scatter(classifier.support_vectors_[:,0], classifier.support_vectors_[:,1], marker="+",color="purple",label="support vectors")

<matplotlib.collections.PathCollection at 0x15653c70>

Here is how the different SVM kernels compare.

As you can see, there are a lot of support vectors this time ( + symbols in the plot above ) as the separating plane is pretty non-linear even in higher dimensional space. How did SVM do this ?

### Hyperplane

To understand how SVM has done this, we have to first understand what a Hyperplane is. You see this word thrown out a lot in SVM. We cannot visualize hyperplane beyond 3 dimensions. However, a hyperplane is a (n-1) dimensional space in a n dimensional space. For example,

- a line ( 1-D ) is a hyperplane in a 2-D space.
- a 2-D plane is a hyperplane in a 3-D space.

It basically separates the space into 2 halves. I am over-simplifying it, but you get the idea.

By projecting the data into a higher dimensional space SVM is able to derive a hyper plane that does a clear separation of the data points. This is the crux of the math behind SVM.

### Performance Tuning

There are basically 3 parameters for performance tuning. We have already seen one of them – the **kernel**. We will talk about the remaining two now.

### Cost

This parameter controls the balance between Accuracy and Stability. Outliers are good examples to illustrate this. Look at the picture below.

Because of an outlier in the setosa species, SVM is trying to fit it to the hyper plane. In all probability it is an outlier. This “over-fit” could result in the model de-stabilizing to make the model more accurate. Basically, you have to tell the model to go easy on data points like that. This is where the cost function comes in. What does a cost function do ? If we lower the costs function, more support vectors are included, resulting in a stable fit.

**Low Cost** – A lower cost value ( say c = 0.1 ) will result in more support vectors being included in determining the line. So, the outlier ( in green ) is basically ignored. This is also called “soft margin” and this avoids over-fitting.

**High Cost** – A higher cost value ( say c = 1 or 10 ) will try to accomodate the outlier ( bold, green dot in the picture above ) and it adjusts the hyper plane accordingly. This is also called “hard margin” and results in over-fit ( crisper and narrower fit ).

A more visual example would illustrate this better.

Here is an implementation of the same iris dataset with different Cost function (C value) for Radial Basis function kernel.

### Kernel

This is the heart of the SVM algorithm. Always think of SVM as a linear classification in higher dimensional space. When data is linearly separable, we don’t even need SVM – Simple linear regression would do. The “kernel” is a mathematical function that is used to project the data to higher dimensional space. Mathematically, this is done using dot products and there are many types of kernel functions – polynomial, linear, radial, sigmoid etc. Luckily, we wouldn’t have to learn all that math. The SVM algorithm chooses the kernel for us automatically. If you look at the model itself, you will understand what type of kernel was used in doing the classification.

This is a parameter that is rarely touched. But since this is the core of the hyperplane transformation, we have to understand the type of kernel that is being used.

### Gamma – γ

Gamma is a parameter that is used in non-linear kernel transformations. A large gamma (γ) will result in high bias and low variance and a small value will result in low bias and high variance. If you understand the gamma parameter, you would have to understand the formula used in a non-linear transform like a radial basis function. You can leave it for now.

These parameters ( cost, kernel, gamma ) are called Hyper Parameters.

### SVM for Regression

Just with most classification models, SVM can also be used for regression. Let’s quickly do a comparison between a regular linear model and SVM. This is to just prove that

- SVM can do regression ( as well as classification )
- SVM can do a better job than linear modeling in most cases.

Let’s take the Boston Housing data set.

from sklearn import datasets boston = datasets.load_boston() boston_data = boston.data boston_target = boston.target print ( boston.feature_names)

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']

We know from Introduction to Linear Regression in Python that one of the variables LSTAT has a linear relationship with the target (predictor) – MEDV (Median house price).

plt.scatter(boston_data[:,12],boston_target) plt.xlabel("LSTAT - Lower Stata Population") plt.ylabel("MEDV - Median Housing Price")

Text(0, 0.5, 'MEDV - Median Housing Price')

In this case the correlation is negative, but that’s fine. Let’s try to fit a linear model first and then compare it with SVM.

from sklearn.linear_model import LinearRegression model_linear = LinearRegression() model_linear.fit(boston_data[:,12].reshape(-1,1),boston_target) y_predict_linear = model_linear.predict(boston_data[:,12].reshape(-1,1))

regressor = svm.SVR(gamma='auto', C=20) model_svm = regressor.fit(boston_data[:,12].reshape(-1,1),boston_target) y_predict_svm = model_svm.predict(boston_data[:,12].reshape(-1,1))

plt.scatter(boston_data[:,12],boston_target,alpha=0.3) plt.scatter(boston_data[:,12],y_predict_linear,color="red") plt.scatter(boston_data[:,12],y_predict_svm,color="purple") plt.xlabel("LSTAT - Lower Stata Population") plt.ylabel("MEDV - Median Housing Price")

Text(0, 0.5, 'MEDV - Median Housing Price')

Try varying the Cost function (C =