Classification


  Machine Learning in Python

Contents

What is Classification

Imagine a city like this. It has urban areas in the middle ( downtown, mid-town ) and sub-urban areas on the outskirts.

Each of the dwellings ( houses, businesses ) has a GPS location.

Given the co-ordinates of a place, can you figure out if the place is urban or sub-urban ? What a stupid question you might be asking yourself. Why would you ever need to know it ? If a dwelling is urban or sub-urban is something you can know just by looking at it. We have taken this scenario just to make things simple. You can very well imagine more complicated scenarios like terrain mapping an alien planet or mapping friendly neighborhoods vs hostile in a war zone etc. In all these cases, visual clues might not be good enough.

As with any Machine Learning algorithm, we need data points to start with. We need a bunch of data points clearly marked as urban or sub-urban so that we can train the algorithm first.

We list the data points in a grid like so.

Once we have the data, our task is to train the algorithm so that any arbitrary point can now be classified as either urban or rub-urban just based on the data location. This is an example of classification.

We can take a similar example, but instead of classifying the result as either urban or sub-urban, we could be predicting a value – say the value of a house or a piece of real estate.

Point being, we can use KNN for both classification as well as regression. Enough said, let’s check out how KNN works.

How does KNN work

Let’s go back to the same urban/sub-urban problem. What we have is a list of data points – specifically GPS co-ordinates ( latitude and longitude ).

Each of these data points have been classified properly in the training data set. Now, given a new point ( GPS location ), what our algorithm has to do is to classify it as urban or sub-urban.

How does KNN do it ? It is based on calculating the distances to the closest points. These closes points are what KNN refers to as neighbors.

For example, if we calculate the distance between the new point and all other known locations ( training data ) and sort the distance in increasing order,

We see that the top 4 closest points are green ( sub-urban ) and the 5th closest point is yellow ( urban ). So, since the majority of the top 5 closest points is sub-urban, in all probability the new data point is sub-urban as well. Let’s check out the math behind the calculation of distances.

Math behind KNN

Here is a simple formula ( Euclidean distance ) for calculating the distance between 2 points in a 2-D space.

By definition, you can extend the same to 3-D space as shown below.

Euclidean is the most popular formula to measure distance between 2 points in n-dimensional space. However, there are other types of measurement as well.

Once the distances are calculated, all we have to do is to sort the distances in increasing order and take the first “k” points. That is where K-NN gets its name from – “k” nearest neighbors. If the majority of these “k” nearest points belong to a particular category, you classify the new data point as belonging to that particular category.

Iris dataset

A classic dataset that is used to teach classification is the iris dataset. Ronald Fisher, a scientist gathered the sepal and petal lengths of a species of flowers related to Iris category. The dataset comprises of 50 entries each for the 3 species

  • setosa
  • versicolor
  • virginica

The data has 4 features

  • Sepal Length
  • Sepal Width
  • Petal Length
  • Petal Width

and a target

  • Species of flower

Let’s get the dataset first.

from sklearn import datasets

iris = datasets.load_iris()

iris.data[0:5,:]
array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])
iris.feature_names

['sepal length (cm)',
 'sepal width (cm)',
 'petal length (cm)',
 'petal width (cm)']
iris.target
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

What do these values mean ? The DESCR attribute should shows us all the necessary metadata.

print ( iris.DESCR )

.. _iris_dataset:

Iris plants dataset
--------------------

**Data Set Characteristics:**

    :Number of Instances: 150 (50 in each of three classes)
    :Number of Attributes: 4 numeric, predictive attributes and the class
    :Attribute Information:
        - sepal length in cm
        - sepal width in cm
        - petal length in cm
        - petal width in cm
        - class:
                - Iris-Setosa
                - Iris-Versicolour
                - Iris-Virginica
                
    :Summary Statistics:

    ============== ==== ==== ======= ===== ====================
                    Min  Max   Mean    SD   Class Correlation
    ============== ==== ==== ======= ===== ====================
    sepal length:   4.3  7.9   5.84   0.83    0.7826
    sepal width:    2.0  4.4   3.05   0.43   -0.4194
    petal length:   1.0  6.9   3.76   1.76    0.9490  (high!)
    petal width:    0.1  2.5   1.20   0.76    0.9565  (high!)
    ============== ==== ==== ======= ===== ====================

    :Missing Attribute Values: None
    :Class Distribution: 33.3% for each of 3 classes.
    :Creator: R.A. Fisher
    :Donor: Michael Marshall (MARSHALL%PLU@io.arc.nasa.gov)
    :Date: July, 1988

The famous Iris database, first used by Sir R.A. Fisher. The dataset is taken
from Fisher's paper. Note that it's the same as in R, but not as in the UCI
Machine Learning Repository, which has two wrong data points.

This is perhaps the best known database to be found in the
pattern recognition literature.  Fisher's paper is a classic in the field and
is referenced frequently to this day.  (See Duda & Hart, for example.)  The
data set contains 3 classes of 50 instances each, where each class refers to a
type of iris plant.  One class is linearly separable from the other 2; the
latter are NOT linearly separable from each other.

.. topic:: References

   - Fisher, R.A. "The use of multiple measurements in taxonomic problems"
     Annual Eugenics, 7, Part II, 179-188 (1936); also in "Contributions to
     Mathematical Statistics" (John Wiley, NY, 1950).
   - Duda, R.O., & Hart, P.E. (1973) Pattern Classification and Scene Analysis.
     (Q327.D83) John Wiley & Sons.  ISBN 0-471-22361-1.  See page 218.
   - Dasarathy, B.V. (1980) "Nosing Around the Neighborhood: A New System
     Structure and Classification Rule for Recognition in Partially Exposed
     Environments".  IEEE Transactions on Pattern Analysis and Machine
     Intelligence, Vol. PAMI-2, No. 1, 67-71.
   - Gates, G.W. (1972) "The Reduced Nearest Neighbor Rule".  IEEE Transactions
     on Information Theory, May 1972, 431-433.
   - See also: 1988 MLC Proceedings, 54-64.  Cheeseman et al"s AUTOCLASS II
     conceptual clustering system finds 3 classes in the data.
   - Many, many more ...

Exploratory Data Analysis

Before we start to do understand KNN better, let’s explore the data visually ( just a bit )

Scatter Plot – Shows visually how the target is distributed vis-a-vis the predictors. For classification problems, start looking at clusters of data in the scatter plot. More clustering means, a better chance of classification. Try it with different combination of features.

scatter = plt.scatter(iris.data[:,0], iris.data[:, 1], c=iris.target)

plt.xlabel(iris.feature_names[0])
plt.ylabel(iris.feature_names[1])

Text(0, 0.5, 'sepal width (cm)')
scatter = plt.scatter(iris.data[:,2], iris.data[:, 3], c=iris.target)

plt.xlabel(iris.feature_names[2])
plt.ylabel(iris.feature_names[3])
Text(0, 0.5, 'petal width (cm)')

There are many more plots that can give a better understanding of the data, but for now, let’s get back to KNN.

KNN by hand

Before we do KNN in Python, it is a good idea to try to do this distance calculation by hand in excel. This will give us a good idea on what is going on behind the hood. Let’s do it on the iris data set as it is a good example of a categorical response variable.

Step 1 : Load the iris dataset from sklearn’s datasets class

rom sklearn import datasets

iris = datasets.load_iris()

Step 2 : Write out the iris file to excel

import numpy as np

# Prepare the target dataset by concatenating data and target
data = np.concatenate((iris.data,iris.target.reshape(-1,1)),axis=1)

# write out the data as a csv
np.savetxt("../data/iris.csv", data, 
                                  fmt="%1.2f",
                                  delimiter = ",", 
                                  header="Sepal Length,Sepal Width, Petal Length, Petal Width, Species")

You can find the excel used to calculate the distances by hand using simple formulas below.

Once you do the ranking, all you have to do is pick the top 5 or 10 ( closest ) and the species that most of these top entries belong to is the one that the test entry should belong to.

KNN in Python

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier


iris = datasets.load_iris()

model_knn = KNeighborsClassifier(n_neighbors=3).fit(iris.data,iris.target)

Pick the say the first row from the iris dataset, say and it should be of species “Setosa”

Verify it by predicting using the KNN model.

test = np.array([5.1,3.5,1.4,0.2]).reshape(-1,4)
model_knn.predict(test)

array([0])

Let’s pick up another row , say the 52nd row.

This one should be a Versicolor species.

# Say the first row from the iris dataset, say
test = np.array([7.0,3.2,4.7,1.4]).reshape(-1,4)
model_knn.predict(test)

array([1])

Looks accurate, right ?

Let’s split up the data into training and test datasets and check the accuracy of prediction.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.2)

print ( "X_train size", X_train.shape)
print ( "X_test size", X_test.shape)
print ( "y_train size", y_train.shape)
print ( "y_train size", y_test.shape)

X_train size (120, 4)
X_test size (30, 4)
y_train size (120,)
y_train size (30,)

Let’s now create the model with the training data.

model_knn = KNeighborsClassifier(n_neighbors=3).fit(X_train,y_train)

And let’s predict the model with the test data.

y_pred = model_knn.predict(X_test)

Confusion Matrix

In Linear Regressin, we have seen how to check the accuracy of prediction using r2. That was for continous target variables. In the case of the iris dataset, the target is a categorical variable. So, r2 does not apply here. A simple way to check the accuracy of a classification is via a confusion matrix.

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)
print ( cm )

[[11  0  0]
 [ 0  9  2]
 [ 0  0  8]]

To plot the confusion matrix, you would have to download the following code from sklearn.

from sklearn.utils.multiclass import unique_labels

def plot_confusion_matrix(y_true, y_pred, classes,
                          normalize=False,
                          title=None,
                          cmap=plt.cm.Blues):
    """
    This function prints and plots the confusion matrix.
    Normalization can be applied by setting `normalize=True`.
    """
    if not title:
        if normalize:
            title = 'Normalized confusion matrix'
        else:
            title = 'Confusion matrix, without normalization'

    # Compute confusion matrix
    cm = confusion_matrix(y_true, y_pred)
    # Only use the labels that appear in the data
    classes = classes[unique_labels(y_true, y_pred)]
    if normalize:
        cm = cm.astype('float') / cm.sum(axis=1)[:, np.newaxis]
        print("Normalized confusion matrix")
    else:
        print('Confusion matrix, without normalization')

    print(cm)

    fig, ax = plt.subplots()
    im = ax.imshow(cm, interpolation='nearest', cmap=cmap)
    ax.figure.colorbar(im, ax=ax)
    # We want to show all ticks...
    ax.set(xticks=np.arange(cm.shape[1]),
           yticks=np.arange(cm.shape[0]),
           # ... and label them with the respective list entries
           xticklabels=classes, yticklabels=classes,
           title=title,
           ylabel='True label',
           xlabel='Predicted label')

    # Rotate the tick labels and set their alignment.
    plt.setp(ax.get_xticklabels(), rotation=45, ha="right",
             rotation_mode="anchor")

    # Loop over data dimensions and create text annotations.
    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i in range(cm.shape[0]):
        for j in range(cm.shape[1]):
            ax.text(j, i, format(cm[i, j], fmt),
                    ha="center", va="center",
                    color="white" if cm[i, j] > thresh else "black")
    fig.tight_layout()
    return ax


np.set_printoptions(precision=2)

# Plot non-normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, normalize    = False,
                      classes = iris.target_names , 
                      title='Confusion matrix, without normalization')

# Plot normalized confusion matrix
plot_confusion_matrix(y_test, y_pred, classes= iris.target_names , normalize=True,
                      title='Normalized confusion matrix')

plt.show()
Confusion matrix, without normalization
[[11  0  0]
 [ 0  9  2]
 [ 0  0  8]]
Normalized confusion matrix
[[1.   0.   0.  ]
 [0.   0.82 0.18]
 [0.   0.   1.  ]]

To get the average accuracy across all the categories, use the score method of KNeighborsClassifier class.

model_knn.score(X_test,y_test)

0.9333333333333333

It is essentially an implementation of sklearn’s accuracy_score that can be used for calculating average score of predicted vs real values of categorical data.

from sklearn.metrics import accuracy_score

accuracy_score(y_test,y_pred)
0.9333333333333333
import matplotlib.pyplot as plt
%matplotlib inline

plt.matshow(cm)

Visualizing Classification Results

Visualizing classification results is not straight forwardfor the simple reason that it is not all that used. However, since we are just learning classification, it would make sense to visualize the results. Unfortunately, there are no standard plots available – so we would have to build something on our own. There are 4 steps to this

  • Step 1 – Model the Classifier
  • Step 2 – Use NumPy’s meshgrid to generate grid co-ordinates
  • Step 3 – Use the classifier to predict all the grid co-ordinates and plot them as a scatter plot.
  • step 4 – Finally, scatter plot the actual data points over Step 3’s scatter plot.

Note – If you are learning classification for the first time, feel free to skip to the results and don’t worry about the steps.

Step 1 – Model the Classifier

from sklearn import datasets
from sklearn.neighbors import KNeighborsClassifier


iris = datasets.load_iris()

model_knn = KNeighborsClassifier(n_neighbors=20).fit(iris.data[:,[0,1]],iris.target)

Step 2 – Use NumPy’s meshgrid to generate grid co-ordinates

x_min = iris.data[:,0].max()
x_max = iris.data[:,0].min()

y_min = iris.data[:,1].max()
y_max = iris.data[:,1].min()
import numpy as np

x = np.linspace(x_min,x_max,1000)
y = np.linspace(y_min,y_max,1000)

xx,yy = np.meshgrid(x,y)

grid_data =  np.column_stack((xx.reshape(-1,1),yy.reshape(-1,1)))

Step 3 – Use the classifier to predict all the grid co-ordinates and plot them as a scatter plot.

z = model_knn.predict(grid_data)
z = z.reshape(xx.shape)
z.shape
(1000, 1000)
z
array([[2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0],
       ...,
       [2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0],
       [2, 2, 2, ..., 0, 0, 0]])

step 4 – Finally, scatter plot the actual data points over Step 3’s scatter plot.

import matplotlib.pyplot as plt
%matplotlib inline
plt.contourf(xx,yy,z,alpha=0.1)
plt.scatter(iris.data[:,0],iris.data[:,1],c=iris.target)
plt.xlabel("Sepal Length")
plt.ylabel("Sepal Width")
Text(0, 0.5, 'Sepal Width')

Eachof the colors (purple,green,yellow) represent the three different species ( “Setosa”, “Versicolor”, “Virginica” ). As you can see, there is some overlap between Versicolor and Virginica , while Setosa is a straight forward case.

KNN for Regression

Although KNN is an algorithm best known for classification problems, we can very well use it for regression. In classification, we chose the majority class in the “k” nearest neighbors. In regression we chose the average of the “k” nearest neighbors.

Doing this in Python is just as easy. You just have to use a different function for regression. Since the response variable in iris is a categorical variable, let’s use another dataset – Boston Housing that has a continuous variable for the response variable.

Step 1 – Get the Boston Housing dataset

from sklearn import datasets

boston = datasets.load_boston()

Step 2 – Split the data into training and test dataIn [6]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)

print ( "X_train size", X_train.shape)
print ( "X_test size", X_test.shape)
print ( "y_train size", y_train.shape)
print ( "y_train size", y_test.shap
X_train size (404, 13)
X_test size (102, 13)
y_train size (404,)
y_train size (102,)

Step 3 – Model the data using KNeighborsRegressor Class

from sklearn.neighbors import KNeighborsRegressor

model = KNeighborsRegressor(n_neighbors=3)
model.fit(X_train, y_train)

KNeighborsRegressor(algorithm='auto', leaf_size=30, metric='minkowski',
          metric_params=None, n_jobs=None, n_neighbors=3, p=2,
          weights='uniform')

Step 4 – Predict the Median House Prices based on the model

y_pred = model.predict(X_test)

Step 5 – Check the accuracy of prediction

r2_score = model.score(X_test, y_test)
print (r2_score)

0.48813113665058205

If you want to check the prediction visually, you can use a scatter plot. The more scattered the points are from the 450 line, the more in accurate it is.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(y_test,y_pred)

We have used all the predictors in the Boston Housing data to predict the house value. We haven’t used feature selection at all. That is a bad idea and thats why you get such a bad R2 score. But that’s ok for now . We just want do see that,

KNN can do Regression as well.

and not just Classification.

Feature Scaling

Feature scaling is always an important parameter in most machine learning algorithms – and so it is with KNN as well. We can very well illustrate this with the iris dataset.

from sklearn import datasets

iris = datasets.load_iris()
iris.data[0:5,:]

array([[5.1, 3.5, 1.4, 0.2],
       [4.9, 3. , 1.4, 0.2],
       [4.7, 3.2, 1.3, 0.2],
       [4.6, 3.1, 1.5, 0.2],
       [5. , 3.6, 1.4, 0.2]])

If you look at the iris dataset, all the lengths ( sepal, petal ) are in centimeter. Say, we convert one of the predictors ( say Sepal Length – the first column ) to millimeters ( mm ). How would it affect the accuracy ?

import numpy as np

iris_unscaled = np.copy(iris.data)
iris_unscaled[:,0] = iris_unscaled[:,0] * 10

iris_unscaled[0:5,:]
array([[51. ,  3.5,  1.4,  0.2],
       [49. ,  3. ,  1.4,  0.2],
       [47. ,  3.2,  1.3,  0.2],
       [46. ,  3.1,  1.5,  0.2],
       [50. ,  3.6,  1.4,  0.2]])

Let’s fit the model again and see if the accuracy of prediction is affected this time with unscaled data.

from sklearn.neighbors import KNeighborsClassifier

model_knn_unscaled = KNeighborsClassifier(n_neighbors=3).fit(iris_unscaled,iris.target)

model_knn_unscaled.score(iris_unscaled,iris.target)

0.92
from sklearn.neighbors import KNeighborsClassifier

model_knn_scaled = KNeighborsClassifier(n_neighbors=3).fit(iris.data,iris.target)
model_knn_scaled.score(iris.data,iris.target)

0.96
Model	        Score
Scaled Model	0.96
Un-scaled Model	0.92

It is too small a dataset to really predict if there has been a difference due to the wrong scaling of the features, but nevertheless it will give you an idea on how the accuracy can differ due to feature scaling.

Since KNN is based on distances between data points, always make sure that the data is scaled properly.

Question – Does changing the scale from mm ( x 10 ) to micro meter ( x 100 ) of the first column ( Sepal Length ) affect the score further ?
True
False

Effect of Outliers

What is an Outlier

Just like most ML algorithms, outliers affect the fit of the data greatly – and so is it with KNN. Let’s inject some outliers into the iris dataset and see if the results are affected. One good plot to detect outliers is the box and whisker plot. Know more about Boxplots in Python if needed.

plt.boxplot(iris.data[:,0])
iris_data_new = iris.data.copy()
iris_data_new[0,:]

array([4.3, 3.5, 1.4, 0.2])
iris_data_new[[1,20,40,60,80,100,120,140],:] = iris_data_new[[1,20,40,60,80,100,120,140],:] * 50

np.set_printoptions(precision=1, suppress=True, linewidth=250)

iris_data_new

array([[  4.3,   3.5,   1.4,   0.2],
       [220. , 150. ,  70. ,  10. ],
       [  6.1,   2.5,   4. ,   1.3],
       [  6.1,   2.6,   4.4,   1.2],
            ......    .......
  
       [  7.7,   2.5,   5. ,   1.9],
       [  7.7,   3. ,   5.2,   2. ],
       [  7.7,   3.4,   5.4,   2.3],
       [  7.9,   3. ,   5.1,   1.8]])
from sklearn.neighbors import KNeighborsClassifier

model_knn_outliers = KNeighborsClassifier(n_neighbors=3).fit(iris_data_new,iris.target)
model_knn_outliers.score(iris_data_new,iris.target)

y_pred = model_knn_outliers.predict(iris_data_new)

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(iris.target, y_pred)
print ( cm )

[[50  0  0]
 [ 0 50  0]
 [ 0  1 49]]

Parameter Tuning

What is the ideal value of k (_nneighbors) in K Nearest neighbors ? Should it be 3 or 10 or 15 ?

model = KNeighborsRegressor(n_neighbors=3)

We don’t have to just pick an arbitrary “k” value – We can do cross validation for different K values and pick the right one. Make sure you have the caret package installed.

from sklearn.model_selection import cross_val_score
from sklearn.neighbors import KNeighborsClassifier
import sklearn.datasets

iris = datasets.load_iris()
knn = KNeighborsClassifier(n_neighbors=3)
scores = cross_val_score(knn, iris.data, iris.target, cv=20, scoring='accuracy')
print(scores)
print("Average score with k = 3 is ",scores.mean())

[1.         1.         0.88888889 1.         1.         1.
 0.77777778 0.88888889 1.         1.         1.         0.83333333
 1.         1.         1.         1.         1.         1.
 1.         1.        ]
Average score with k = 3 is  0.9694444444444444
k_values = range(1,50)

knn_scores = []

for k in k_values:
    knn = KNeighborsClassifier(n_neighbors=k)
    scores = cross_val_score(knn, iris.data, iris.target, cv=20, scoring='accuracy')
    knn_scores.append(scores.mean())

for i in k_values :
    print ( "k = ",i," | score = ",knn_scores[i-1])
k =  1  | score =  0.961111111111111
k =  2  | score =  0.9444444444444444
k =  3  | score =  0.9694444444444444
k =  4  | score =  0.9694444444444444
k =  5  | score =  0.975
k =  6  | score =  0.975
k =  7  | score =  0.9694444444444444
k =  8  | score =  0.9777777777777779
k =  9  | score =  0.9722222222222221
k =  10  | score =  0.9722222222222221
k =  11  | score =  0.9722222222222221
k =  12  | score =  0.9666666666666668
k =  13  | score =  0.9694444444444444
k =  14  | score =  0.9722222222222221
k =  15  | score =  0.975
k =  16  | score =  0.9666666666666668
k =  17  | score =  0.9722222222222221
k =  18  | score =  0.9722222222222221
k =  19  | score =  0.9805555555555555
k =  20  | score =  0.9805555555555555
k =  21  | score =  0.975
k =  22  | score =  0.9666666666666668
k =  23  | score =  0.9666666666666668
k =  24  | score =  0.9722222222222221
k =  25  | score =  0.961111111111111
k =  26  | score =  0.9666666666666668
k =  27  | score =  0.961111111111111
k =  28  | score =  0.95
k =  29  | score =  0.9555555555555555
k =  30  | score =  0.9555555555555555
k =  31  | score =  0.9555555555555555
k =  32  | score =  0.9555555555555555
k =  33  | score =  0.9555555555555555
k =  34  | score =  0.9472222222222222
k =  35  | score =  0.961111111111111
k =  36  | score =  0.9472222222222222
k =  37  | score =  0.9472222222222222
k =  38  | score =  0.9472222222222222
k =  39  | score =  0.9472222222222222
k =  40  | score =  0.95
k =  41  | score =  0.95
k =  42  | score =  0.95
k =  43  | score =  0.9555555555555555
k =  44  | score =  0.95
k =  45  | score =  0.9555555555555555
k =  46  | score =  0.9444444444444444
k =  47  | score =  0.9388888888888889
k =  48  | score =  0.9277777777777778
k =  49  | score =  0.9388888888888889

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(k_values,knn_scores)

KNN Properties

KNN is Non Parametric

KNN is a non-parametric algorithm – For example, compare KNN with linear regression or Naive Bayes which are parametric.

So, with KNN, there are

  • No parameters to learn
  • No assumptions about the underlying data model

In KNN, the Data is the model

KNN is a lazy learning Algorithm

There is no learning in KNN. All the predictions happen at run-time without any prior work – In other words, no prior modeling is required.

KNN – Bias – variance

Bias variance in KNN is dependent on the only hyper parameter – k. As K increases the bias increases and variance decreases ( and vice-versa ).

How to handle Categorical data in KNN

Can KNN handle categorical data ? yes, theoretically it can by using the standard techniques to convert numeric data to categorical data. You can use any of the following methods to handle categorical data in KNN.

  • One hot encoding
  • Combine levels
  • String as factors

KNN Data Assumptions

Does KNN need data in a particular way ? Most models have some kind of restrictions on how the data should be – For example,

  • Naive Bayes assumes that the predictors are independent.
  • Linear Regression has many assumptions – like multi-variate normality, multi-collinearity etc.

The beauty of KNN is that there are no data assumptions at all. The data doesn’t have to distributed a certain way, nor are correlations among predictors an issue.

KNN does not have any assumptions on the underlying data

age = 23
type(int)

int
weight = 175.6
type(weight)

float
weight = weight
type(weight)
weight

1.756e+64