Regression

Regression


  Machine Learning in Python

Summary : Regression is a Machine Learning Technique in which we estimate something (specifically numeric values) based on past experience. The purpose of this section is to just cover the basics of regression along with some statistical methods. Consider this more of an introduction to regression rather than covering all the regression techniques in Machine Learning.

We will be learning Simple Linear Regression, Multiple Linear Regression and Polynomial Regression in this section. On top of the basic technique itself, we will also be covering many statistical concepts (like Null Hypothesisp-valuer-2RMSE ) and other key concepts in machine learning like Feature SelectionTraining and Test data splits etc that will be useful in evaluating models going forward as well.

Contents


What is Regression

In Machine Learning, most problems are classified as supervised vs unsupervised learning. We will first focus on supervised learning algorithms and later work on unsupervised learning algorithms. Supervised learning is once again split into the following 2 groups

  • Classification
  • Regression

Given a particular height and weight, classify the person as either male or female. This is an example of classification. You are essentially trying to classify the person – in this case – as male or female based on certain characteristics.

In contrast, say you are trying to predict the body fat percentage based on height and weight – this is an example of a regression problem. What is the difference ? Body Fat % is a continuous variable – say it starts at a minimum of 2% (really lean) and can go all the way up to 50 % say (extremely obese) – as opposed to a categorical variable in the example above ( Male or Female ).


Why Regression

If you are learning how to solve a regression problem for the first time, you probably need to understand why you need regression to start with. This is probably the simplest of the regression problems. Let’s start with a simple data set – Swedish auto insurance claims. You can google it or get it from kaggle. It is a very small data set with just 2 variables –

  • Number of Claims
  • Total payment for these claims ( in thousands )
Data set - Swedish auto insurance claims
Data Set – Swedish auto Insurance claims

Claims come first and the settlement happens much later. Assuming these are claims this company receives per week, is there a way we can predict how much the company will end up paying, just based on the number of claims ?


What value does this bring to the company ?

Being able to predict the payment based on the number of claims gives a very good understanding of the companies expenses very much in advance.


Why do you need machine learning for this problem ?

Each claims is different – A fender bender claims costs a thousand dollars and a total could cost 20 grand. The type of claim does make for a good predictor, but let’s just assume we don’t have that at this point. Even if we had the type of claim, a fender bender can cost anywhere from 300 to 2000 dollars based on the severity of damage, the arbitration and several environmental factors. Essentially, there is no easy way to correlate the claims to the payment. If we tried to do this using some kind of IF, THEN logic, we would be going around in hoops.


Solve Regression in Python

import pandas as pd 

data = pd.read_csv("../../data/insurance.csv", skiprows = 6, header = None, names=["claim_count","claim_value"]) 
data.head() 

Output

  claim_count	claim_value
0   108	         392.5
1   19	         46.2
2   13		 15.7
3   124		 422.2
4   40		 119.4

Looking OK – but since we are reading data from file, we have to ensure that Python is not reading integers as strings or other object types. Let’s quickly verify if the data types are correct.

data.dtypes

Output

claim_count      int64
claim_value    float64
dtype: object

Looking good. Now, onto LinearRegression. Before we do that, we would have to install the Python package – scikit-learn.

 pip install scikit-learn 
# or 
 conda install scikit-learn

The module that we need to solve Linear Regression is LinearRegression from linear_model package.

from sklearn.linear_model import LinearRegression 
model = LinearRegression()
model.fit(data["claim_count"],data["claim_value"]) 
-------------------------------------------------------------------------- ValueError: Expected 2D array, got 1D array instead: array=[108  19  13 124  40  57  23  14  45  10   5  48  11  23   7   2  24   6    3  23   6   9   9   3  29   7   4  20   7   4   0  25   6   5  22  11   61  12   4  16  13  60  41  37  55  41  11  27   8   3  17  13  13  15    8  29  30  24   9  31  14  53  26]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
model.fit(data[["claim_count"]],data["claim_value"])

Output

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,          
       normalize=False)

Our model is ready. Let’s start predicting claims based on the count of claims. We have used the fit method to fit the model – to predict, we will be using the predict method. But before we do that, let’s plot this out to understand what we have done so far.

predict = model.predict(data[["claim_count"]])
slope = model.coef_ slope 

Output

array([3.41382356])
intercept   = model.intercept_
intercept

Output

19.99448575911478
import matplotlib.pyplot as plt 
%matplotlib inline
 plt.scatter( data["claim_count"],data["claim_value"]) 

Output

Linear Regression has already solve this problem for us – we just didn’t realize it yet. The parameters ( also called co-efficients )

  • slope
  • intercept

are the solution.

point_1 = slope*0 + intercept 
point_2 = slope*120 + intercept 
print ( point_1, point_2) 
plt.scatter( data["claim_count"],data["claim_value"])
plt.plot([0,120], [point_1,point_2])

Output

[19.99448576] [429.65331297] 

How did we get the straight line ?

A straight line can be defined mathematically using

y = ax + b

where

  • a = slope
  • b = intercept

These are also called coefficients. The fit function of LinearRegression has already arrived at these numbers ( slope and intercept ). It has done so based on the data

slope       = model.coef_
intercept   = model.intercept_

print ( "a (slope)     = ",slope)
print ( "b (intercept) = ", intercept)

Output

a (slope)     =  [3.41382356]
b (intercept) =  19.99448575911478 

What did we achieve ?

What we have essentially done is predicted a relationship between the number of claims and the total amount paid. For example, what is the total amount expected to be paid when the number of claims is 80 ?

Easy, right ?


Prediction

We don’t have to draw lines like this every time to predict the value of Y for a value of X. You can use the predict ( ) function.

claim_values = model.predict (data[["claim_count"]]) 

or more specifically,

claim_values = model.predict (pd.DataFrame([10,20,30,40,60]))
claim_values

Output

array([ 54.13272136,  88.27095696, 122.40919256, 156.54742816,
       224.82389936])
plt.scatter( data["claim_count"],data["claim_value"])
plt.plot([0,120], [point_1,point_2])
plt.scatter([10,20,30,40,60], claim_values , marker='o')

Output

What were the original values though ? You can pick them up from the CSV.

Insurance Claims
Insurance Claims
Original_claim_values = [65.3,98.1,194.5,119.4,202.4] 

Let’s also plot these to compare how well we predicted.In [20]:

plt.scatter( data["claim_count"],data["claim_value"])
plt.plot([0,120], [point_1,point_2])

plt.scatter([10,20,30,40,60], claim_values , marker='o') plt.scatter([10,20,30,40,60], original_claim_values , marker='o') 

Output

Some values are pretty close, some are a bit off – but nevertheless its a good prediction for the amount of time we spent doing it.


Simple Linear Regression

What we have seen so far is an example of implementing Simple Linear Regression in Python. What we will see in this section is the math behind it – Give it a try – if it gets tedious, please do a couple of re-reads. It is critical that you understand this section.


How did Linear Regression fit the straight line

The fun starts now. How did the LinearRegression ( ) function fit the straight line ? How did it it arrive at this equation

y = 3.4 x + 19.99

Obviously, there is no one straight line that will pass through all the points in this case.

If you take these 4 data points, we can eyeball a straight line that goes through the middle of it. The ones marked with question marks are visually not a good fit at all. But the question that linear model tries to solve is,

What is the “Optimum” straight line that best describes the relationship between X and Y

This is where statistics comes in.


Let’s zoom in

Let’s simplify and make up some numbers ( for easy calculation) of claims vs payments. Say we have a set of 5 data points for claims vs payments and we wish to fit a linear model that can predict further data. This is a very small data set to do anything practical, but there is a reason why we are doing such a small data set as we will see in the coming sections.

Claims Data
Claims Data

Let’s plot these on a graph.

Claim vs Payments
Claim vs Payments

If we were asked to eyeball a straight line that best fits these data points, this is how we would do it.

Approximate Line
Approximate Line

How did we do it ? Our eye is essentially trying to minimize the distances from each of these points to the straight line. The best fitting straight line is one which minimizes the distances for all these points.

Linear regression in machine learning does exactly that – Instead of a human eye, machine learning takes the help of statistics to do this approximation. There are a couple of methods to do this in statistics.

  • Ordinary Least Squares
  • Gradient Descent

Let’s explore the first method here.


Residuals

When we tried to minimize the distance of each of these points from the line we are trying to fit, the distances between the points and the straight line ( on the y axis ) are called residuals.

Residuals
Residuals

Sum of Squares

Warning – Geeky Statistics stuff

To arrive at the best fit values for the straight line, statisticians have arrived at the following formula based on the method of least squares. How they arrived at it is pretty geeky and let’s leave that to the statisticians.

Equations
Equations

This equation sounds scary, but it is not. I am going to prove it to you in a minute down below. There are 2 things in this equation that require an explanation.

  • The weird symbol that looks like a knocked up W . It is used for summation.
  • y with a bar on it ( or x with a bar ). The bar just represents the average. So y with a bar on it represents the average.

Let’s take the same numbers that we have above and try to calculate the formula by hand. Excel would make things easy, but let’s just do it manually, since the numbers are not all that bad.

That was huge – Now we can easily calculate a and b from the Y = a + bx


Validation

Let’s cross validate this equation.

sample = pd.DataFrame( {'x' : [20,40,60,80,100],'y':[40,60,80,80,90]}) 
sample 

Output

	x	y
0	20	40
1	40	60
2	60	80
3	80	80
4	100	90

Let’s model this data and plot it

model.fit(sample[["x"]],sample["y"])

Output

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,          normalize=False)
plt.scatter( sample["x"],sample["y"]) 

Output

Now, let’s use our model to visualize the fit.

slope       = model.coef_ 
intercept   = model.intercept_ 

print ( "a (slope)     = ",slope) 
print ( "b (intercept) = ", intercept) 
a (slope)     =  [0.6] 
b (intercept) =  34.0 

These are the slope and intercept values. We can now use these to plot the fit line.

point_1 = slope*0 + intercept 
point_2 = slope*120 + intercept
 
print ( point_1, point_2)

plt.scatter( sample["x"],sample["y"])
plt.plot([0,120], [point_1,point_2]) 
[34.] [106.]

The model object has a whole host of other information that you can use to predict how good of a fit the line is to the data. But first, let’s predict the value in a table.

pred_y = model.predict (sample[["x"]]) pred_y 

Output

array([46., 58., 70., 82., 94.])

The differences ( residuals ) are highlighted below.

There are so many other parameters ( like the p-valuer-squaredr-squared adjusted ) and graphs ( Residuals vs fitted , Q-Q Plot etc ) that are used to analyze the performance of a model fit. We are going to get to that in the next section. To set the stage for these parameters, let’s scale up our data set.


Multilinear Regression

So far we have seen one predictor and one response variable. This is also called simple linear regression. Ofcoure, real world is not that simple, right ? There can be multiple predictors. We are going to look at one such example in a classic example – Boston Housing dataset.

It has 13 predictors and 1 response variable. So, the equation for that mathematically would be

the value of n would be 13 in this case. Let’s look at the dataset below.


Boston Housing dataset

Predicting the price of a house is based on many parameters like the size, neighborhood, crime rate, pollution etc. Essentially, there is no mathematical equation that can predict the price of a house based on these parameters – that’s essentially where ML comes into the picture.

In the earlier example, we just had to predict the value of Y given a value of X. There is just 1 predictor ( X ). However, let’s have a look at the Boston Housing data set – tons of variables. Load the data first.

There are a couple of ways to load the data.


Download and load from file system

import pandas as pd
import numpy as np

boston_housing = pd.read_csv("../../data/boston_housing.csv") boston_housing.head() 

Output

The attribute names are a bit cryptic. So, here are the descriptions.

  • CRIM  per capita crime rate by town
  • ZN  proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS  proportion of non-retail business acres per town
  • CHAS  Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX  nitric oxides concentration (parts per 10 million)
  • RM  average number of rooms per dwelling
  • AGE  proportion of owner-occupied units built prior to 1940
  • DIS  weighted distances to five Boston employment centres
  • RAD  index of accessibility to radial highways
  • TAX  full-value property-tax rate per USD – 10,000
  • PTRATIO pupil-teacher ratio by town
  • B  1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in USD 1000’s

Load it directly from scikit learn module

from sklearn.datasets  import load_boston 
boston = load_boston() 
type(boston) 

Output

sklearn.utils.Bunch

bunch is a scikit learn object used for loading in-built data packages – like Boston Housing or Iris. To view the data, just do

np.set_printoptions(precision=1, suppress=True, linewidth=250)
 
# Show just the top 10 rows in the dataset
boston.data[1:10,:] 

Output

array([[  0. ,   0. ,   7.1,   0. ,   0.5,   6.4,  78.9,   5. ,   2. , 242. ,  17.8, 396.9,   9.1],
       [  0. ,   0. ,   7.1,   0. ,   0.5,   7.2,  61.1,   5. ,   2. , 242. ,  17.8, 392.8,   4. ],
       [  0. ,   0. ,   2.2,   0. ,   0.5,   7. ,  45.8,   6.1,   3. , 222. ,  18.7, 394.6,   2.9],
       [  0.1,   0. ,   2.2,   0. ,   0.5,   7.1,  54.2,   6.1,   3. , 222. ,  18.7, 396.9,   5.3],
       [  0. ,   0. ,   2.2,   0. ,   0.5,   6.4,  58.7,   6.1,   3. , 222. ,  18.7, 394.1,   5.2],
       [  0.1,  12.5,   7.9,   0. ,   0.5,   6. ,  66.6,   5.6,   5. , 311. ,  15.2, 395.6,  12.4],
       [  0.1,  12.5,   7.9,   0. ,   0.5,   6.2,  96.1,   6. ,   5. , 311. ,  15.2, 396.9,  19.1],
       [  0.2,  12.5,   7.9,   0. ,   0.5,   5.6, 100. ,   6.1,   5. , 311. ,  15.2, 386.6,  29.9],
       [  0.2,  12.5,   7.9,   0. ,   0.5,   6. ,  85.9,   6.6,   5. , 311. ,  15.2, 386.7,  17.1]])

Since all of the data is numeric in nature, it is set up as a NumPy object. What else is available ? The Bunch object is more or less like a dictionary. so, you can use the keys ( ) method to get all the keys.In [40]:

Input

boston.keys() 

Output

dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])

Now, let’s see the features

Input

boston.feature_names 

Output

array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')

How large is the dataset ?

Input

boston.data.shape 

Output

(506, 13)

so, it is 506 rows and 13 columns.

And here is the description of what these columns represent.

Input

print ( boston.DESCR )
# The reason why we are doing print is because the DESCR value is a formatted string ( with new line characters like \n etc)
# otherwise, you get unformatted text
.. _boston_dataset:

Boston house prices dataset
---------------------------

**Data Set Characteristics:**  

    :Number of Instances: 506 

    :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target.

    :Attribute Information (in order):
        - CRIM     per capita crime rate by town
        - ZN       proportion of residential land zoned for lots over 25,000 sq.ft.
        - INDUS    proportion of non-retail business acres per town
        - CHAS     Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
        - NOX      nitric oxides concentration (parts per 10 million)
        - RM       average number of rooms per dwelling
        - AGE      proportion of owner-occupied units built prior to 1940
        - DIS      weighted distances to five Boston employment centres
        - RAD      index of accessibility to radial highways
        - TAX      full-value property-tax rate per $10,000
        - PTRATIO  pupil-teacher ratio by town
        - B        1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
        - LSTAT    % lower status of the population
        - MEDV     Median value of owner-occupied homes in $1000's  

If you observe closely, the actual price column ( MEDV ) that you see in the loaded file ( from pandas.read_csv() ) previously is missing in the scikit learn data. That is because it is available as a separate key in the scikit learn Bunch called target. Let’s see that as well.

Looking back at our equation,

Input

# Just show the top 10 target values
boston.target[1:10]

Output

array([21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9])

Let’s create a Pandas dataframe from this data as well for further analysis.

Input

boston_housing = pd.DataFrame(boston.data)

# set the columns
boston_housing.columns = boston.feature_names

# add the price
boston_housing["MEDV"] = boston.target

# now, look at the head
boston_housing.head()

Output

Alright, now that we know how to get the dataframe ( via file upload or scikit learn library itself ), let’s quickly do a linear regression on this data.


Response Variable vs Predictor

What is the problem we are trying to solve ? We are actually trying to predict the Median house price based on a variety of parameters – like Crime rate, Nitric Oxide content in the air, age of the house etc. Each of these parameters have a “say” in the price of the house. The target variable is valled the Response Variable and the parameters that have a “say” in determining the target variable are called the Predictors.

In our case, there are 12 predictors ( CRIM, ZN, INDUS…LSTAT) and 1 response variable ( MEDV ). Just for simplicity sake, let’s just pick one parameter – rm – the number of rooms in the house. Let’s see how well we predict the price of the house based on just this one parameter.

from sklearn.linear_model import LinearRegression

model = LinearRegression()

model.fit( boston_housing[["RM"]], boston_housing["MEDV"])

Output

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
Let's plot our prediction
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(boston_housing["RM"], boston_housing["MEDV"])

Output

Now, let’s draw the straight line that our model has predicted. The maximum number of rooms in our data is 9 and the minimum is 0. so, let’s determine the co-ordinates to draw our fit line.

slope       = model.coef_
intercept   = model.intercept_

point_1 = slope*0 + intercept
point_2 = slope*9 + intercept

plt.scatter(boston_housing["RM"], boston_housing["MEDV"])
plt.plot([0,9], [point_1,point_2],color="r")

Output

Looks like a decent enough fit. Let’s do another – lstat – lower status population percentage.

Step 1 – Model the data

model.fit( boston_housing[["LSTAT"]], boston_housing["MEDV"])

Output

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)

Step 2 – Determine the slope and intercept from the model

slope       = model.coef_
intercept   = model.intercept_

point_1 = slope*0 + intercept
point_2 = slope*9 + intercept

Step 3 – Plot the data and fit line

plt.scatter(boston_housing["LSTAT"], boston_housing["MEDV"])
plt.plot([0,9], [point_1,point_2],color="r")

Output

The line seems off, right ? Why is that ? How did we determine the points ? (Based on the older data – number of rooms). The range of data for LSTAT on the other hand starts at 0 and ends at 40. So, lets re-determine the line.

slope       = model.coef_
intercept   = model.intercept_

point_1 = slope*0 + intercept
point_2 = slope*40 + intercept
plt.scatter(boston_housing["LSTAT"], boston_housing["MEDV"])
plt.plot([0,40], [point_1,point_2],color="r")

Output

This seems accurate enough. However, if you look at both these graphs, the relationship between the predictors and response variable seems a bit different between the predictor LSTAT vs RM. Let’s put these graphs side by side to understand better.

The first picture is lstat vs medv and the second is rm vs medv.

Also, not all variables might be relevant ( irrespective of the direction, decrease or increase ). Let’s take the parameter dis – distance to employment. Once again, if we try to fit this using our steps

# step 1
model.fit( boston_housing[["DIS"]], boston_housing["MEDV"])

# step 2
slope       = model.coef_
intercept   = model.intercept_

point_1 = slope*0 + intercept
point_2 = slope*12 + intercept

# step 3
plt.scatter(boston_housing["DIS"], boston_housing["MEDV"])
plt.plot([0,12], [point_1,point_2],color="r")

Output

Visually, there is not much of a linear relationship between distance and median value ( although we tried to fit it ). How exactly do we measure the relationship ?


Correlation

This is the simplest measure of relation between 2 numeric variables. Luckily, pandas provides a built-in method for calculating correlation – corr ( ). For example,

boston_housing["MEDV"].corr(boston_housing["DIS"])

Output

0.24992873408590394
boston_housing["MEDV"].corr(boston_housing["LSTAT"])

Output

-0.7376627261740147

This seems in-line with our plots above right ?

Correlation is NOT causation

However strong the correlation is, it does NOT imply causation. This is a bit tricky to understand. For example, look at the picture below.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(boston_housing["RM"],boston_housing["MEDV"])

There seems to be a significant correlation between the number of rooms ( rm ) and the median price value (medv).

Now, imagine another column – Power Consumption . As the number of rooms increase, the power consumption also tends to increase.

Does it mean that Power Consumption has a strong correlation with the house price ? Very unlikely – isn’t it ? Mathematically, correlation is just a tool that signifies the level of correlation between 2 variables. However, it is upto us (or the domain expert) to determine if the correlation is real or ficticious.

Quiz
Question : The higher the correlation, lower the stronger is the relationship between the variables.
True
False
Question : Correlation is a measure of how strong the relationship is between two variables.
True
False
Question :0.95 represents a strong positive correlations.
True
False
Question : A correlation value of 0 shows that there is a perfect correlation between the variables
False
True
Question : The picture below shows a strong negative correlation
True
False
x = [10,20,30,40,50,60,70,80,90,100]
y = [100,90,80,70,60,50,40,30,20,10]

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x,y)

p value

Also called probability value, p-value answers the following question –

If a predictor is relevant to predict the response and if yes, how relevant is it ?

We have to understand p-value in the context of Null Hypothesis.


Null Hypothesis

Null hypothesis (denoted as H0 assumes that there is no relationship between the predictor and the response variable. For example, if you look at the relationship between the number of rooms ( rm ) and the median price value of the home ( medv ) in the Boston Housing dataset, Null hypothesis says that there is NO relationship between them.

Well, although we can see a linear relationship visually ( almost ) between those 2 variables, we start off with the Null Hypothesis. It is indicated in statistics as H0

Alternate hypothesis indicates that they are related. It is indicated as H1 . P-value indicates how much the observed data is inconsistent with the Null Hypothesis. This is a bit cryptic, right ? Let’s explore this further.

Let’s just consider 2 random variables that are normally distributed. Since they are random variables, there would be no relationship between them, right ? Let’s check it out.

Dataset 1

A normal distribution of 100 values with a mean of 100 and sd of 20

import numpy as np

x = np.random.normal(100, 20, 20)
x

Output

array([105.9,  97.7,  97.3, 135.9,  76.2,  86.7, 109.2,  40.8, 117.6, 127. ,  82. , 127.8, 107.3,  83. ,  98.8, 102.4, 116.8,  97.3,  95.2,  73.6])

Dataset 2

Another normal distribution of 100 values with a mean of 100 and sd of 20

y = np.random.normal(100, 20, 20)
y

Output

array([ 65.2, 137. , 111.1,  93.4,  69.9, 121.5, 126.7,  97.1, 108.9,  83.8, 116.1, 118.3,  92.3, 118.1,  67.3,  89.3, 108.4, 142.4, 136. ,  87.7])

Let’s plot x vs y and see how it looks.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x,y)

Output

This almost looks like the night sky, isn’t it ? Point being, there is no relationship between x and y as they are random variables – That is pretty understandable right ? Let’s try to calculate a relationship between these two variables ( although there is none ) and see how it compares against another situation where there IS actually a relationship.

from sklearn.linear_model import LinearRegression
import pandas as pd

# Step 1 - Model
model = LinearRegression()

model.fit( pd.DataFrame(x), y)

Output

LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
         normalize=False)
# step 2
slope       = model.coef_
intercept   = model.intercept_

point_1 = slope*0 + intercept
point_2 = slope*120 + intercept

# step 3
plt.scatter(x, y)
plt.plot([0,120], [point_1,point_2],color="r")

Output

What is the p-value in this case ?

Unfortunately, the standard scikit learn’s LinearRegression() object does not provide p-value. There are other libraries that can help us with determining the p-value. For example, let’s use the scipy library.

rom scipy import stats
import numpy as np

slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
p_value

Output

0.9469781750196463

And the p-value in this case is 0.89.

p-value is 0.89 – that’s 89 %. p-value is always between 0 and 1. 0.89 is a big number right ? Does it indicate that there is a strong relationship between x and y in this case ? On the contrary, a high p-value indicates that the there is NO relationship between x and y – which is what the Null Hypothesis states.

On the other hand, if we calculate the p-value of the relationship between the number of rooms (“RM”) and median price (“MEDV”), then you get a very small value.

from scipy import stats
import numpy as np

slope, intercept, r_value, p_value, std_err = stats.linregress(boston.data[:,5],boston.target)
p_value

Output

2.4872288710080936e-74

p-value in this case is 2.487 x 10-74. That is an extremely low value ( 0.0000..00248 ) .

What it means is that there is a 0.0000..00248 % chance that the correlation is random.

or in other words, there is a 99.99999 % chance that the data does represent a relationship between the two variables. Whereas in the case of the random variales, there is a 89% chance that the correlation is random


Optimum p-value

A value below which we can safely conclude that the relationship between 2 variables did not happen by chance, but because there is a true causal relationship, is called an optimum p-value.

Is there a fixed p-value below which, we can safely conclude a strong relationship (Alternate Hypothesis) ?

Typically p <= 0.05 is accepted in most cases. However, the actual value at which the business is comfortable is based on a variety of parameters like

  • type of data
  • level of confidence the business requires etc

Key Model Evaluation Parameters

r-squared – r2

R-squared is a measure of the explanatory power of the model – How well are the variables able to explain the response variables using the model. A model’s r2 varies between 0 and 1. Unlike p-value, scikit provides r2 values ( without having to reply on scipy ).

Method 1 – Using scipy

from scipy import stats
import numpy as np

slope, intercept, r_value, p_value, std_err = stats.linregress(boston.data[:,5],boston.target)
print ( "r-squared = ",r_value**2)

Output

r-squared =  0.48352545599133373

Method 2 – Using scikit learn

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston

boston = load_boston()
print ( boston.feature_names)

X = boston.data
y = boston.target.reshape(-1,1)

# Fit model based on RM - Number of rooms
model   = LinearRegression().fit(X[:,[5]],y)

r2_score = model.score(X[:,[5]],y)
print ( "r-squared = ",r2_score)

Output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'] 
r-squared =  0.4835254559913343 

In the example above, we tried to model the response variable medv ( median house value ) as a measure of the number of rooms ( rm ) – the r2 value is 0.4835 . It is a measure of how well the model explains the relationship. A low value of r2 ( r2 = 0 ) means that the explanatory variables are not able to predict the response variable well

import numpy as np

x = np.random.normal(100, 20, 20)
y = np.random.normal(100, 20, 20)

slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print ( "r-squared = ",r_value**2)

Output

r-squared =  0.3178323469801873

For example, if you look at the random value example, the r2 value is 0.00725 – which is very close to 0. What this means is that the explanatory variable (x) is not able to explain the variance in the response variable (y).

A high value or r2 ( r2= 1 ) means that the explanatory variables are fully able to predict the response – In this case the number of rooms ( rm ) is able to explain the variance in the median house value around 48 %. The remaining 52% variance is unexplained by the explanatory variable.


How is r2 calculated

The formula to calculate r2 is

r2 Calculation
r2 Calculation

The denominator Sum of Squares total is the worst possible score. The numerator Sum of Squares residuals is how far the model is performing. So 2 is essentially a normalized score of how well the model is predicting the data – in comparision to the worst possible score.

What do we mean by worst possible score ? Well, if we know the weights of each of the individuals in the class, without any indicator of whose that weight is ( just a list of weights ), what is our best possible prediction of weight for anybody in the class ? We just do an arithmetic average – right ? And that’s what we use to calculate the Sum of Squares total .

Let’s calculate r2 by hand for our simple dataset.

r-Squared for Simple  data
r-Squared for Simple data

Let’s verify this in Python.

x = [20,40,60,80,100]
y = [40,60,80,80,90 ]
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print ( "r-squared = ",r_value**2)
r-squared =  0.8999999999999999

There you go – our manual calculation verifies with Python’s calculation.

Exercise – Calculate the r2 value of of Boston housing dataset for the predictor – NOX ( Level of Nitric Oxide in the air ).


r-squared adjusted

Mathematically, r2 has a peculiar property. Adding more predictors increases the value of r2 . This is not very intuitive to begin with. Let’s try it on our Boston Housing dataset.

In the example above, we have tried to model medv from rm . So, the only explanatory variable is rm ( number of rooms ). Based on that relationship we have an r2 value of 0.4835 . What would happen if we add another explanatory variable ? – say lstat ( percentage of lower status of the population ). Unfortunately, scipy’s scipy.stats.linregress or sklearn.metrics.r2_score cannot deal with more than 1 predictor. For that you need a new library called statsmodels. Just install it using pip or conda.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston

boston = load_boston()
print ( boston.feature_names)

X = boston.data
y = boston.target.reshape(-1,1)

# Fit model based on RM - Number of rooms
model   = LinearRegression().fit(X[:,[5]],y)

r2_score = model.score(X[:,[5]],y)
print ( "r-squared = ",r2_score)

Output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
r-squared =  0.4835254559913343
# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status

model   = LinearRegression().fit(X[:,[5,12]],y)

r2_score = model.score(X[:,[5,12]],y)
print ( "r-squared = ",r2_score)

Output

r-squared =  0.6385616062603403

See, the 2 has increased from 0.48 to 0.63 by including a second predictor. How about a third predictor ?

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. NOX   - Nitric oxide in the air

model   = LinearRegression().fit(X[:,[4,5,12]],y)

r2_score = model.score(X[:,[4,5,12]],y)
print ( "r-squared = ",r2_score)

Output

r-squared =  0.6389103767491082

There is a slight increase in 2 – from 0.6385 to 0.6389.

r2 value    – Predictors 
0.483525 – number of rooms 
0.638561 – number of rooms + lower stata population 
0.638910 – number of rooms + lower stata population + Nitric Oxide in the air

You can keep adding as many predictors as you want and you can observe that 2 value always increases with every predictor. That doesn’t seem right. Isn’t it ? Let’s try something. To the boston housing dataset, let’s add a column with just random numbers and see if 2 still increases. If it does, then we have a problem.

from sklearn.datasets import load_boston
import numpy as np


boston = load_boston()

# Generate a column of 506 random numbers
x = np.random.normal(100, 20, 506)
x = x.reshape(-1,1)

# and add it to the boston dataset
boston.data = np.append(boston.data,x,axis=1)

# what is the new shape ? It should be 13 columns ( as opposed to the original 12 )
print ( boston.data.shape)

Output

(506, 14)

Now, let’s try the regression with the predictors RM, LSTAT, NOX and the new random variable.

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score

X = boston.data
y = boston.target.reshape(-1,1)

model   = LinearRegression().fit(X[:,[4,5,12,13]],y)

r2_score = model.score(X[:,[4,5,12,13]],y)
print ( "r-squared = ",r2_score)

Output

r-squared =  0.6395935817018061

r2 value – Predictors 
0.483525 – number of rooms 
0.638561 – number of rooms + lower stata population 
0.638910 – number of rooms + lower stata population + Nitric Oxide in the air
0.639007 – number of rooms + lower stata population + Nitric Oxide in the air + some random variable

This is crazy, right ? Just add any random variable ( which is not supposed to have any predictive power) and still the 2 increases ? You are probably beginning to doubt the predictive power of 2 in the first place. Well, it’s not all bad with 2. Just that every random variable has some predictive power. In order to counter this, there is a new variable called r2 adjusted.

where

  • n = sample size
  • p = number of predictors

Essentially, the new parameter r2 adjusted works by penalizing more parameters. That is why p – the number of predictors is in the denominator.

Unfortunately, scikit learn doesn’t have a dedicated calculation for r2 adjusted. We would have to calculate that on our own. Let’s write a quick function for the same.

def r2_adjusted ( r2, n, p ) :
    r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
    return r2_adjusted

Now, let’s rerun those 4 scenarios again.

def r2_adjusted ( r2, n, p ) :
    r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
    return r2_adjusted

from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston

boston = load_boston()
print ( boston.feature_names)

X = boston.data
y = boston.target.reshape(-1,1)

# Fit model based on RM - Number of rooms
model   = LinearRegression().fit(X[:,[5]],y)

r2_score    = model.score(X[:,[5]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],1)
print ( "r-squared = ",r2_score)
print ( "r-squred adjusted = ", r2_adjusted)

Output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
r-squared =  0.4835254559913343
r-squred adjusted =  0.48250070491195196
def r2_adjusted ( r2, n, p ) :
    r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
    return r2_adjusted

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status

model   = LinearRegression().fit(X[:,[5,12]],y)

r2_score = model.score(X[:,[5,12]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],2)
print ( "r-squared = ",r2_score)
print ( "r-waured adjusted = ", r2_adjusted)

Output

r-squared =  0.6385616062603403
r-waured adjusted =  0.6371244754701231
def r2_adjusted ( r2, n, p ) :
    r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
    return r2_adjusted

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. NOX   - Nitric oxide in the air

model   = LinearRegression().fit(X[:,[4,5,12]],y)

r2_score = model.score(X[:,[4,5,12]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],2)
print ( "r-squared = ",r2_score)
print ( "r-waured adjusted = ", r2_adjusted)

Output

r-squared =  0.6389103767491082
r-waured adjusted =  0.6374746327202776

Now, add the random variable.

from sklearn.datasets import load_boston
import numpy as np


boston = load_boston()

# Generate a column of 506 random numbers
x = np.random.normal(100, 20, 506)
x = x.reshape(-1,1)

# and add it to the boston dataset
boston.data = np.append(boston.data,x,axis=1)

# what is the new shape ? It should be 13 columns ( as opposed to the original 12 )
print ( boston.data.shape)
X = boston.data


Output

(506, 14)
def r2_adjusted ( r2, n, p ) :
    r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
    return r2_adjusted

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. NOX   - Nitric oxide in the air
# 4. Random variable 

model   = LinearRegression().fit(X[:,[4,5,12,13]],y)

r2_score = model.score(X[:,[4,5,12,13]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],2)
print ( "r-squared = ",r2_score)
print ( "r-waured adjusted = ", r2_adjusted)

Output

r-squared =  0.639463625000628
r-waured adjusted =  0.6380300807660381

r2 value    – r2-adjusted – Predictors 
0.483525 – 0.482500 – number of rooms 
0.638561 – 0.637124 – number of rooms + lower stata population 
0.638910 – 0.637474 – number of rooms + lower stata population + Nitric Oxide in the air
0.639007 – 0.637672 – number of rooms + lower stata population + Nitric Oxide in the air + some random variable

What it goes to show is that, adding more predictors does not necessarily increase the explanatory power of the model. r2 adjusted accommodates for this by incorporating a penalty for the number of predictors ( more the predictors, lesser the r2 adjusted ).

When you add a predictor it should add significant value. If it is not ( For example, adding NOX – Nitric Oxide as a predictor ) r2-adjusted tells you that it is a worthless predictor and you better get rid of it.


RMSE – Root Mean Square error

While r2 is a relative measure of fit, RMSE is an absolute measure of fit. Lower values indicate a better fit and higher values not. Calculating RMSE is quite simple – it is quite similar to the standard deviation (the square root of variance). While Standard Deviation ( sd ) is a measure of how far the distribution is from the mean, RMSE is a measure of how far the fit is from the actuals.

So, to calculate the RMSE, you just have to borrow the residuals from the model and do a couple of mathematical operations ( quite similar to how you do them to the differences from the mean in the case of mean )

# from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston = load_boston()
print ( boston.feature_names)

X = boston.data
y = boston.target.reshape(-1,1)

# Fit model based on RM - Number of rooms
model   = LinearRegression().fit(X[:,[5]],y)

r2_score = model.score(X[:,[5]],y)
print ( "r-squared = ",r2_score)

y_predict = model.predict(X[:,[5]])
mse       = mean_squared_error(y,y_predict)
rmse      = sqrt(mse)
print ( "RMSE = ",rmse)

# rms = sqrt(mean_squared_error(y, y_predict))

Output

['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
r-squared =  0.4835254559913343
RMSE =  6.603071389222561

Now, you might have a question here. When you already have the RMSE to measure the fit, why do you need another metric – r2 ? Well, you can’t measure the growth rate of an elephant and a horse in absolute terms. RMSE is in the magnitute of the actual response variable, while r2 is on a uniform scale ( 0 to 1 ).


Feature Selection

In the Boston Housing dataset example, there are 13 predictors and 1 response variable.

from sklearn.datasets import load_boston

boston = load_boston()

print ( boston.feature_names )
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT'] 

And the response variable is MEDV – Median house value.

Previously, we have seen some examples of predicting the house value based on a random set of predictors –

  • RM – Number of rooms
  • LSTAT – Percentage of people with lower status
  • NOX – Nitric oxide content etc.

However, we know that not all these variables have an equal say. So, how do we know which of the predictors have the most impact in predicting the house price ?

The process of identifying the features ( predictors ) that have the most predictive power is called Feature Selection

This process is called Stepwise Regression. Let’s explore it in the next section.


Stepwise Regression

In stepwise regression, we select a parameter to determine the fit. Based on how well the model is being fit, we add or keep removing predictors ( features ) until we come to a point where we can longer improve the model based on the selected parameter.

What kind of parameters can be used to evaluate the model ? There are many choices like

  • p-value
  • r-squared
  • r-squared (adjusted )
  • F-tests or T-tests etc

For now, since we are already aware of p-value let’s choose it as our selection criteria parameter.

A better library to show p-values for individual features(predictors) is statsmodels . Let’s install it.

 pip install statsmodels
import statsmodels.api as sm
import numpy as np
from sklearn.datasets import load_boston

boston = load_boston()

X = boston.data
y = boston.target

# precision = 3 - only show until 3 decimals
# suppress = True - suppresses scientific notation ( e-00x )
# linewidth - expands the default width ( 75 ) of printing each line

np.set_printoptions(precision=3, suppress=True, linewidth=150)
print(boston.data[0:5,:])

Output

[[  0.006  18.      2.31    0.      0.538   6.575  65.2     4.09    1.    296.     15.3   396.9     4.98 ]
 [  0.027   0.      7.07    0.      0.469   6.421  78.9     4.967   2.    242.     17.8   396.9     9.14 ]
 [  0.027   0.      7.07    0.      0.469   7.185  61.1     4.967   2.    242.     17.8   392.83    4.03 ]
 [  0.032   0.      2.18    0.      0.458   6.998  45.8     6.062   3.    222.     18.7   394.63    2.94 ]
 [  0.069   0.      2.18    0.      0.458   7.147  54.2     6.062   3.    222.     18.7   396.9     5.33 ]]
model = sm.OLS(y,X).fit()
model.summary()

Output

OLS Regression Results


Dep. Variable:	y	R-squared:	0.959
Model:	OLS	Adj. R-squared:	0.958
Method:	Least Squares	F-statistic:	891.3
Date:	Wed, 01 May 2019	Prob (F-statistic):	0.00
Time:	19:32:44	Log-Likelihood:	-1523.8
No. Observations:	506	AIC:	3074.
Df Residuals:	493	BIC:	3128.
Df Model:	13		
Covariance Type:	nonrobust

		
coef	std err	t	P>|t|	[0.025	0.975]
x1	-0.0929	0.034	-2.699	0.007	-0.161	-0.025
x2	0.0487	0.014	3.382	0.001	0.020	0.077
x3	-0.0041	0.064	-0.063	0.950	-0.131	0.123
x4	2.8540	0.904	3.157	0.002	1.078	4.630
x5	-2.8684	3.359	-0.854	0.394	-9.468	3.731
x6	5.9281	0.309	19.178	0.000	5.321	6.535
x7	-0.0073	0.014	-0.526	0.599	-0.034	0.020
x8	-0.9685	0.196	-4.951	0.000	-1.353	-0.584
x9	0.1712	0.067	2.564	0.011	0.040	0.302
x10	-0.0094	0.004	-2.395	0.017	-0.017	-0.002
x11	-0.3922	0.110	-3.570	0.000	-0.608	-0.176
x12	0.0149	0.003	5.528	0.000	0.010	0.020
x13	-0.4163	0.051	-8.197	0.000	-0.516	-0.317



      Omnibus:	 204.082	Durbin-Watson:	   0.999
Prob(Omnibus):	 0.000	        Jarque-Bera (JB):  1374.225
         Skew:	 1.609	        Prob(JB):	   3.90e-299
     Kurtosis:	 10.404	        Cond. No.	   8.50e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.5e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Since we have to chosen p-value as our key assesment criteria, we are interested in this part of the table.

Irrespective of the selection criteria, there are 2 basic methods in stepwise regression.

  • Backward Elimination
  • Forward Selection

Backward Elimination

As we see from the table above, not all predictors have the same p-value. So, in backward elimination, we start by eliminating the predictor with the worst parameter value (p-value in this case) and re-evaluate the model again.

# Eliminate INDUS predictor. 
X = boston.data[:,[0,1,3,4,5,6,7,8,9,10,11,12]]
model = sm.OLS(y,X).fit()
model.summary()
OLS Regression Results
Dep. Variable:	y	R-squared:	0.959
Model:	OLS	Adj. R-squared:	0.958
Method:	Least Squares	F-statistic:	967.5
Date:	Wed, 01 May 2019	Prob (F-statistic):	0.00
Time:	19:32:51	Log-Likelihood:	-1523.8
No. Observations:	506	AIC:	3072.
Df Residuals:	494	BIC:	3122.
Df Model:	12		
Covariance Type:	nonrobust
		
      coef	std err	t	P>|t|	[0.025	0.975]
x1	-0.0928	0.034	-2.701	0.007	-0.160	-0.025
x2	0.0488	0.014	3.412	0.001	0.021	0.077
x3	2.8482	0.898	3.171	0.002	1.083	4.613
x4	-2.9275	3.222	-0.909	0.364	-9.258	3.403
x5	5.9318	0.303	19.555	0.000	5.336	6.528
x6	-0.0073	0.014	-0.527	0.598	-0.034	0.020
x7	-0.9655	0.189	-5.099	0.000	-1.337	-0.593
x8	0.1723	0.064	2.687	0.007	0.046	0.298
x9	-0.0095	0.004	-2.693	0.007	-0.016	-0.003
x10	-0.3930	0.109	-3.607	0.000	-0.607	-0.179
x11	0.0149	0.003	5.544	0.000	0.010	0.020
x12	-0.4165	0.051	-8.225	0.000	-0.516	-0.317



      Omnibus:	204.123	   Durbin-Watson:	0.999
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1374.966
         Skew:	1.609	        Prob(JB):	2.69e-299
     Kurtosis:	10.406	        Cond. No.	8.16e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

After having eliminated features in a step-wise fashion, we are left with the following parameter with low enough values.

# Include only CHAS,RM,DIS,PRATIO,B,LSTAT
X = boston.data[:,[3,5,7,10,11,12]]

model = sm.OLS(y,X).fit()
model.summary()

Output

OLS Regression Results

Dep. Variable:	y	R-squared:	0.957
Model:	OLS	Adj. R-squared:	0.956
Method:	Least Squares	F-statistic:	1853.
Date:	Wed, 01 May 2019	Prob (F-statistic):	0.00
Time:	19:32:55	Log-Likelihood:	-1537.2
No. Observations:	506	AIC:	3086.
Df Residuals:	500	BIC:	3112.
Df Model:	6		
Covariance Type:	nonrobust
		
    coef	std err	t	P>|t|	[0.025	0.975]
x1	2.9135	0.907	3.211	0.001	1.131	4.696
x2	5.7809	0.237	24.361	0.000	5.315	6.247
x3	-0.4197	0.123	-3.416	0.001	-0.661	-0.178
x4	-0.6254	0.091	-6.855	0.000	-0.805	-0.446
x5	0.0153	0.003	6.023	0.000	0.010	0.020
x6	-0.5028	0.041	-12.398	0.000	-0.583	-0.423


Omnibus:	185.681	Durbin-Watson:	        1.014
Prob(Omnibus):	0.000	Jarque-Bera (JB):	1105.939
Skew:	        1.479	Prob(JB):	        7.05e-241
Kurtosis:	9.611	Cond. No.	        1.48e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.48e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

After having eliminated parameters with high p-value, we have narrowed it down to the following parameters.

  • CHAS – Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • RM – average number of rooms per dwelling
  • DIS – weighted distances to five Boston employment centres
  • PRATIO – pupil-teacher ratio by town
  • B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT – lower status of the population

Now, let’s try and fit the linear regression with just these parameters.

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. DIS   -  weighted distances to five Boston employment centres
# 4. PRATIO - pupil-teacher ratio by town 
# 5. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
# 6. CHAS - Charles river dummy variable

from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston

boston = load_boston()

X = boston.data
X_reduced = X[:,[3,5,7,10,11,12]]
X_reduced = sm.add_constant(X_reduced)
model   = LinearRegression().fit(X_reduced,y)

r2_score = model.score(X_reduced,y)
print ( "r-squared = ",r2_score)

Output

r-squared =  0.7074867589886451

In case you are wondering why the r2 value is different from the Statsmodels’ OLS method (0.957) vs the sklearn’s LinearRegression method (0.707), it is because statsmodels by default does not include the intercept in determining the linear regression equation. You can see that in the help section of OLS documentation.

You can easily fix this by including a constant in the data itself. Statsmodels actually provides a method for that add_constant.

# Include only CHAS,RM,DIS,PRATIO,B,LSTAT
X = boston.data
X_reduced = X[:,[3,5,7,10,11,12]]
X_reduced = sm.add_constant(X_reduced)

model = sm.OLS(y,X_reduced).fit()
model.summary()

Output

OLS Regression Results
Dep. Variable:	y	R-squared:	0.707
Model:	OLS	Adj. R-squared:	0.704
Method:	Least Squares	F-statistic:	201.2
Date:	Wed, 01 May 2019	Prob (F-statistic):	1.01e-129
Time:	19:33:07	Log-Likelihood:	-1529.2
No. Observations:	506	AIC:	3072.
Df Residuals:	499	BIC:	3102.
Df Model:	6		
Covariance Type:	nonrobust
		
 
         Coef  std err	t       P>|t|	[0.025	0.975]
const	17.0082	4.255	3.997	0.000	8.648	25.368
x1	2.7209	0.895	3.039	0.003	0.962	4.480
x2	4.4045	0.416	10.582	0.000	3.587	5.222
x3	-0.5574	0.126	-4.428	0.000	-0.805	-0.310
x4	-0.9049	0.114	-7.946	0.000	-1.129	-0.681
x5	0.0115	0.003	4.316	0.000	0.006	0.017
x6	-0.6039	0.047	-12.769	0.000	-0.697	-0.511


Omnibus:	171.907	  Durbin-Watson:	1.039
Prob(Omnibus):	0.000	  Jarque-Bera (JB):	785.723
Skew:	        1.447	  Prob(JB):	        2.41e-171
Kurtosis:	8.375	  Cond. No.	        7.08e+03

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

Now you see that the r2 value matches with sklearn’s LinearRegression method.

Stepwise regression with backward elimination (based on r2) is just one of the methods. There are many other methods that we will be learning as we progress through the course.


Accuracy of the model

We have seen enough statistics for now. But we haven’t actually predicted anything yet, right ? Using the 5 parameters that we have narrowed down to, let’s start predicting median house prices and see how accurate our model is.

In order to predict data, we first need to train the model ( which we have done already ). However, instead of training the model on the entire dataset, we split the data into training and test data sets. This process is also called Train/Test split. And as you might have guessed already, sklearn already has methods for doing this.


Training and Test datasets

How many rows of data do we have in Boston Housing dataset ?

from sklearn.datasets import load_boston

boston = load_boston()

boston.data.shape

Output

(506, 13)

506 rows. That’s all the data we have. Now, let’s split this data into training and test datasets. What is the ratio of the split ?

Typically a 80-20 % split should do. You can also do a 75-25 % split. The exact percentage splits would probably be based mostly on the accuracy of the data and a bit of trail and error.

Essentially, this what we are trying to do.

sklearn’s train_test_split will ensure that the split is made according to the percentage specified and that the rows are random.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)

print ( "X_train size", X_train.shape)
print ( "X_test size", X_test.shape)
print ( "y_train size", y_train.shape)
print ( "y_train size", y_test.shape)

print ( "ratio = ", X_test.shape[0]/boston.data.shape[0])

Output

X_train size (404, 13)
X_test size (102, 13)
y_train size (404,)
y_train size (102,)
ratio =  0.2015810276679842

train_test_split has done a good split @ 20% test data. Now, that we know how to do the split, let’s do it on just the predictors that we need.

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. DIS   -  weighted distances to five Boston employment centres
# 4. PRATIO - pupil-teacher ratio by town 
# 5. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston

boston = load_boston()

# Use only the predictors specified above. 
X = boston.data[:,[5,7,10,11,12]]

X_train, X_test, y_train, y_test = train_test_split(X, boston.target, test_size=0.2)

model   = LinearRegression().fit(X_train,y_train)

Now that we have trained the model, let’s test it.

y_pred = model.predict(X_test)

Now, let’s find out how accurate our predictions are. A quick and dirty way to visually see this would be to plot the predicted vs actuals on a scatterplot

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter ( y_pred, y_test)
plt.xlabel  ("Predictions")
plt.ylabel  ("Actuals" )

Output

Text(0, 0.5, 'Actuals')

If the prediction were 100% correct, we would get a perfect 450 line. Let’s draw that as a baseline to compare the prediction’s performance.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter ( y_pred, y_test)
plt.xlabel  ("Predictions")
plt.ylabel  ("Actuals" )

x = [0,50]
y = [0,50]

plt.plot(x,y,color="r")

Output

That’s not a bad fit. Let’s calculate the r2 to have a numeric estimate of how good of a fit we have.

model.score(X_test,y_test)


##Output##
0.6780106182290351

That’s an r2 of 0.613. So, we can say that we are 61% accurate in our prediction.


Polynomial Regression

So far, we have seen examples of Linear Regression ( both simple and multi). However, not all types of data can be fit into a Linear Regression. For example, population growth is a good example of a non-linear data. Probably a better word for it is exponential growth. Let’s take a simple example – The population of India and how it is projected to grow in this century. ( This is real data taken from United Nations at https://population.un.org/wpp/Download/Standard/Population/ )

import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

india = pd.read_csv("../../data/india_population.csv")
india.head()

Output

        year	population
0	1950	376325200
1	1951	382245303
2	1952	388538620
3	1953	395160091
4	1954	402077026

Let’s plot this 150 years of data ( since 1950 until 2015 and projected data until 2100)

plt.plot(india["year"],india["population"])
plt.xlabel("Year")
plt.ylabel("Population - in Billions")

Output

Text(0, 0.5, 'Population - in Billions')

Can you imagine trying to fit this data using Linear Regression ? Well, it is going to be a huge approximation after – say 2040 where the linearity ends. Let’s try it to see how well it fits.

from sklearn.linear_model import LinearRegression

model   = LinearRegression().fit(india[["year"]],india["population"])
# step 2
slope       = model.coef_
intercept   = model.intercept_

point_1 = slope*1950 + intercept
point_2 = slope*2100 + intercept

# step 3
plt.plot(india["year"],india["population"])
plt.plot([1950,2100], [point_1,point_2],color="r")

Output

Well, it does an OK job given that the model is linear, but can we account for the curve somehow ?

This is exactly where polynomials come in. sklearn has built-in methods for this – PolynomialFeature . Let’s use it to fit the data.

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

poly_2   = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(india[["year"]])
np.set_printoptions(precision=3, suppress=True, linewidth=150)
print ( poly_2_X[0:5,:] )

Output

[[      1.    1950. 3802500.]
 [      1.    1951. 3806401.]
 [      1.    1952. 3810304.]
 [      1.    1953. 3814209.]
 [      1.    1954. 3818116.]]

What is really happening here ? A single column ( year ) has now become 3 columns.

Nothing has been predicted yet – we have just transformed the predictors. Now, let’s try to fit the new predictors to the response variable.

model_linear   = LinearRegression().fit(india[["year"]],india["population"])
model_2_poly   = LinearRegression().fit(poly_2_X,india["population"])

data,       = plt.plot(india["year"], india["population"], color="blue", label="data")
linear_fit, = plt.plot(india["year"], model_linear.predict(india[["year"]]), color = 'red', label="linear fit")
poly_2_fit,   = plt.plot(india["year"], model_2_poly.predict(poly_2_X), color = 'green', label="polynomial fit")

plt.legend((data, linear_fit, poly_2_fit), ('Data', 'Linear Fit', 'Polynomial fit (degree = 2)'))

Output

Visually, you can see the green line (polynomial fit) seems to follow the blue line(data) much better than the red line(linear fit), right ? How about we go one step further and see if increasing the polynomial degree would make this any better.

from sklearn.preprocessing import PolynomialFeatures
import numpy as np

poly_3   = PolynomialFeatures(degree=3)
poly_3_X = poly_3.fit_transform(india[["year"]])
np.set_printoptions(precision=3, suppress=True, linewidth=150)
print ( poly_3_X[0:5,:] )

Output

[[1.000e+00 1.950e+03 3.802e+06 7.415e+09]
 [1.000e+00 1.951e+03 3.806e+06 7.426e+09]
 [1.000e+00 1.952e+03 3.810e+06 7.438e+09]
 [1.000e+00 1.953e+03 3.814e+06 7.449e+09]
 [1.000e+00 1.954e+03 3.818e+06 7.461e+09]]
model_3_poly   = LinearRegression().fit(poly_3_X,india["population"])

data,         = plt.plot(india["year"], india["population"], color="blue")
linear_fit,   = plt.plot(india["year"], model_linear.predict(india[["year"]]), color = 'red')
poly_2_fit,   = plt.plot(india["year"], model_2_poly.predict(poly_2_X), color = 'green')
poly_3_fit,   = plt.plot(india["year"], model_3_poly.predict(poly_3_X), color = 'orange')

plt.legend((data, linear_fit, poly_2_fit, poly_3_fit), ('Data', 'Linear Fit', 
                                                        'Poly fit (degree = 2)', 'Poly fit (degree = 3)'))

The orange line (polynomial fit, degree = 3) is hugging the actual data curve (blue) much closer than the green line ( polynomial fit , degree = 2 ), right ? Let’s do the numbers ( r2 score ) as well, to really know the degree of fit.

r2_linear = model.score(india[["year"]],india[["population"]])
r2_poly_2 = model_2_poly.score(poly_2_X,india[["population"]])
r2_poly_3 = model_3_poly.score(poly_3_X,india[["population"]])

print ( r2_linear, " --> R squared - Linear fit")
print ( r2_poly_2, " --> R squared - polynomial fit ( degree = 2)")
print ( r2_poly_3, " --> R squared - polynomial fit ( degree = 3)")

Output

0.8434443860759384  --> R squared - Linear fit
0.9776830799010624  --> R squared - polynomial fit ( degree = 2)
0.9953977657287687  --> R squared - polynomial fit ( degree = 3)

Challenges

Challenge 1

Use polynomial regression to fit the following parameters on the boston Housing dataset.

  • LSTAT ( Lower Status Population percentage )
  • MEDV ( Median house value )

After modeling ,

  • Plot both linear and polynomial model ( degree = 2 ) to visually show how they perform
  • Compare the r2 values between linear and polynomial model
  • Verify if increasing the degree of the polynomial regression ( say degree = 3 or 4 ) increases the performance.

Solution

from sklearn.datasets import load_boston
boston = load_boston()
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(boston.data[:,12],boston.target)
plt.xlabel("LSTAT - Lower Status Population ")
plt.ylabel("MEDV  - Median house price")

Output

Text(0, 0.5, 'MEDV  - Median house price')

Let’s first try to fit this data with a linear regression model.

# Linear Regression
from sklearn.linear_model import LinearRegression

model_linear   = LinearRegression().fit(boston.data[:,12].reshape(-1,1),boston.target)
plt.scatter(boston.data[:,12],boston.target)
plt.scatter(boston.data[:,12],model_linear.predict(boston.data[:,12].reshape(-1,1)))
plt.xlabel("LSTAT - Lower Status Population ")
plt.ylabel("MEDV  - Median house price")

Output

Text(0, 0.5, 'MEDV  - Median house price')

This dataset doesn’t look all that linear right ? Let’s see how our r2 score compares

model_linear.score(boston.data[:,12].reshape(-1,1),boston.target)

#Output
0.5441462975864797

That’s a 54% fit. Now, let’s try a polynomial fit – say second degree

from sklearn.preprocessing import PolynomialFeatures

# Transform linear to polnomial features
poly_2   = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(boston.data[:,12].reshape(-1,1))

model_2_poly   = LinearRegression().fit(poly_2_X,boston.target)
# Plot the actual data along with the fit
plt.scatter(boston.data[:,12],boston.target,alpha=0.25)

linear_fit    = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
poly_2_fit    = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')
model_2_poly.score(poly_2_X,boston.target)

#Output
0.6407168971636612

0.544 – r2 score with linear modeling
0.640 – r2 score with polynomial modeling

That seems to be some improvement, right ?

from sklearn.preprocessing import PolynomialFeatures

# Transform linear to polnomial features
poly_3   = PolynomialFeatures(degree=3)
poly_3_X = poly_3.fit_transform(boston.data[:,12].reshape(-1,1))

model_3_poly   = LinearRegression().fit(poly_3_X,boston.target)
# Plot the actual data along with the fit
plt.scatter(boston.data[:,12],boston.target,alpha=0.25)

linear_fit   = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
poly_2_fit   = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')
poly_3_fit   = plt.scatter(boston.data[:,12], model_3_poly.predict(poly_3_X), color = 'cyan')
model_3_poly.score(poly_3_X,boston.target)

#Output
0.6578476405895719

0.544 – r2 score with linear modeling 
0.640 – r2 score with polynomial modeling ( degree = 2 ) 
0.657 – r2 score with polynomial modeling ( degree = 3 )

That’s just a marginal improvement over second degree polynomial regression. So, you don’t need to go beyond second degree.

How about a 10th degree fit ?

from sklearn.preprocessing import PolynomialFeatures

# Transform linear to polnomial features
poly_10   = PolynomialFeatures(degree=10)
poly_10_X = poly_10.fit_transform(boston.data[:,12].reshape(-1,1))

model_10_poly   = LinearRegression().fit(poly_10_X,boston.target)
# Plot the actual data along with the fit
plt.scatter(boston.data[:,12],boston.target,alpha=0.25)

linear_fit    = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
poly_2_fit    = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')
poly_10_fit   = plt.scatter(boston.data[:,12], model_10_poly.predict(poly_10_X), color = 'cyan')
model_10_poly.score(poly_10_X,boston.target)

#Output
0.6844527059749335

0.544 – r2 score with linear modeling 
0.640 – r2 score with polynomial modeling ( degree = 2 ) 
0.657 – r2 score with polynomial modeling ( degree = 3 ) 
0.684 – r2 score with polynomial modeling ( degree = 10 ) 

See, even if you go to a 10th degree polynomial fitting, the improvement in r2 is just about 0.04 from a second degree polynomial fitting.


Overfitting

Sometimes, where there is not enough data, the model might tend to overfit. Look at the example data below. We are simulating a sin wave.

import numpy as np
import math

x = np.linspace(10,100,num=20)
y = np.sin(x*math.pi/180)

Let’s plot it to see how it it looks

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x,y)

Output

Let’s introduce a bit of variance to make the data a bit realistic.

variance = np.random.rand(20) / 2

y = y + variance
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x,y)

Output

Now, let’s see how a 2nd degree polynomial regression fits.

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Transform linear to polnomial features
poly_2   = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(x.reshape(-1,1))

model_2_poly   = LinearRegression().fit(poly_2_X,y)
# Plot the actual data along with the fit
plt.scatter(x,y)

poly_2_fit   = plt.plot(x, model_2_poly.predict(poly_2_X), color = 'cyan')
plt.title("Good fit")

Output

Text(0.5, 1.0, 'Good fit')

This looks to be a good fit, right ? Now, let’s try a higher order polynomial fit ( say degree = 10 )

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Transform linear to polnomial features
poly_10   = PolynomialFeatures(degree=15)
poly_10_X = poly_10.fit_transform(x.reshape(-1,1))

model_10_poly   = LinearRegression().fit(poly_10_X,y)
# Plot the actual data along with the fit
plt.scatter(x,y)

# linear_fit    = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
# poly_2_fit    = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')
poly_10_fit   = plt.plot(x, model_10_poly.predict(poly_10_X), color = 'cyan')
plt.title("Over fit")

Output

Text(0.5, 1.0, 'Over fit')

Let’s put these two figures together to compare.

Overfitting typically happens when the model is trying to work too hard for the data. And why is it a problem ? Overfitting tries to fit the data too much and hence will not work well for new datasets. Think of overfitting as localizing the solution for the test datasets – it is more or less memorizing the data, not generalizing a solution for the dataset. Obviously, it will not work as well when model is used on real data set. We will see more examples of these when we see other machine learning algorithms down the line.

If you are wondering why the simple linear regression is able to learn the model just enough, but the higher degree polynomial regression is over learning it, that is because the higher order polynomial regression has the flexibility to learn more ( as compared to a linear or second order polynomial regression ). This is actually, good, except that it is not able to discern noise from data. Let’s increase the dataset size to see if the same 15 degree polynomial regression peforms better than a second order.

Increase sample size to 1000

You see, data size did not matter, still the 15th order polynomial regression still overfits the data. The reason is that for the amount of data and noise, a second or 3rd degree polynomial has enough power to capture the complexity of this data. Probably for more complicated data sets, an increase in degree might capture the complexity better.


Detect Overfitting

If you look at the pictures above, we were able to clearly see an overfit. This is because it is a 2 dimensional dataset. Most data is multi-dimensional in real life – in which case we cannot be able to identify an overfit, but just looking at a plot. There are basically 2 methods to identify an overfit.

1. Chose a simpler model.

Look at a measure of the score ( r2) – if there is not a significant difference, then better go for a simpler model.

import numpy as np
import math

x = np.linspace(10,100,num=1000)
y = np.sin(x*math.pi/180)

variance = np.random.rand(1000) / 2
y = y + variance

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# 2nd Degree polynomial fit
poly_2   = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(x.reshape(-1,1))
model_2_poly   = LinearRegression().fit(poly_2_X,y)

# 15th degree polynomial fit
poly_10   = PolynomialFeatures(degree=15)
poly_10_X = poly_10.fit_transform(x.reshape(-1,1))

model_10_poly   = LinearRegression().fit(poly_10_X,y)
print ( model_2_poly.score(poly_2_X,y) )
print ( model_10_poly.score(poly_10_X,y) )


#Output
0.7528691833579731
0.6915335671211775

The r2 score of the 2nd degree fit is better than the 15th degree polynomial model. So, the simpler model ( 2nd degree model) wins.

2. Check the model performance across Training and Test datasets.

Another method to identify an overfit is by validating how well the model is performing against the training dataset vs the test dataset.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)

print ( "X_train size", X_train.shape)
print ( "X_test size", X_test.shape)
print ( "y_train size", y_train.shape)
print ( "y_train size", y_test.shape)

print ( "ratio = ", X_test.shape[0]/x.shape[0])


###Output###
X_train size (800,)
X_test size (200,)
y_train size (800,)
y_train size (200,)
ratio =  0.2

Fit a 2nd degree polynomial regression

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Transform linear to polnomial features
poly_2   = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(X_train.reshape(-1,1))

model_2_poly   = LinearRegression().fit(poly_2_X,y_train)

Fit a 15th degree polynomial regression

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures

# Transform linear to polnomial features
poly_10   = PolynomialFeatures(degree=15)
poly_10_X = poly_10.fit_transform(X_train.reshape(-1,1))

model_10_poly   = LinearRegression().fit(poly_10_X,y_train)

Compare the r2 scores across both the scenarios for training data set.

print ( model_2_poly.score(poly_2_X,y_train), " -->r-squared - for 2nd Degree Polynomial regression on training dataset" )


##Output##
0.7527035537094962  -->r-squared - for 2nd Degree Polynomial regression on training dataset
print ( model_10_poly.score(poly_10_X,y_train),  " -->r-squared - for 2nd Degree Polynomial regression on training dataset" )


##Output##
0.6899011370541518  -->r-squared - for 2nd Degree Polynomial regression on training dataset
poly_2_X_test  = poly_2.fit_transform(X_test.reshape(-1,1))
poly_10_X_test = poly_10.fit_transform(X_test.reshape(-1,1))
print ( model_2_poly.score(poly_2_X_test,y_test),   " -->r-squared - for 2nd Degree Polynomial regression on Test dataset" )
print ( model_10_poly.score(poly_10_X_test,y_test), " -->r-squared - for 10th Degree Polynomial regression  on Test dataset" )


##Output##
0.7531367987161364  -->r-squared - for 2nd Degree Polynomial regression on Test dataset
0.6959846261157662  -->r-squared - for 10th Degree Polynomial regression  on Test dataset

As you can see the r-square score has decreased (slightly though ) between training and test datasets.

r-square score between training and test datasets

Modelr2 score (Train)r2 score (Test)
2nd Degree0.7610.757
15th Degree0.7070.661

This is actually not a bad fit, given that there is just a slight decrease in how the model fits the training vs test datasets. However, if there is a large difference, you better discard the model.


Prevent Overfitting

Cross Validation


Linear Regression Assumptions

Homoskedasticity

In plain english, what this means is that, the variance between the predicted and the actual values should be uniform across the spectrum.

Homoskedasticity = Uniform Variance

Let’s take the relationship between the number of rooms (“RM”) and the median price value (“MEDV”) and try to fit a linear regression model.

from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()

# Linear regression model
model_linear   = LinearRegression().fit(boston.data[:,5].reshape(-1,1),boston.target)

Let’s try to predict the house price now that our model is ready.

y_pred = model_linear.predict(boston.data[:,5].reshape(-1,1))

Now that we have predicted the housing prices, let’s see how they compare to the actual house price. The difference between the predicted and actual values is called residuals.

residuals_rm = boston.target - y_pred

Let’s plot the residuals vs the predicted values.

import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(y_pred,residuals_rm)
plt.xlabel("Predicted Median House Price")
plt.ylabel("Residuals")
plt.title("Fitted vs Residuals")

Output

Text(0.5, 1.0, 'Fitted vs Residuals')

What this plot goes to show is that the variance ( residuals ) is not uniform across the spectrum of house prices. For example, the variance is high between 10K and 35K. But that’s probably because the bulk of the data lies there. You can strip away outliers and restrict our dataset to between those values and try again.

However, what is the ideal Fitted vs Residuals plot that makes the data homoskedastic ( uniform variance ) ?

import numpy as np
import math

# Generate 10K random numbers between 0 and 100.
x = np.random.randint(0,100,100)

variance = np.random.randint(-10,10,100)
y = x + variance
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(x,y,alpha=0.1)

Output

from sklearn.linear_model import LinearRegression

# Linear regression model
model_linear   = LinearRegression().fit(x.reshape(-1,1),y)

# predict y
y_pred = model_linear.predict(x.reshape(-1,1))

#residuals
residuals_rm = y - y_pred
import matplotlib.pyplot as plt
%matplotlib inline

plt.scatter(y_pred,residuals_rm)
plt.xlabel("Predicted Median House Price")
plt.ylabel("Residuals")
plt.title("Fitted vs Residuals")

# draw a horizontal line at 0 to show the horizon.
plt.axhline(color='r', linestyle="-")

Output

This is the ideal plot that confirms homeskedasticity – Also called the Night sky. If you look at the variance(residuals), they are uniformly distributed across the entire spectrum of house prices. What this means is that we can confidently predict the house price across the entire spectrum with the same level of accuracy.

Now, compare it to the previous Fitted vs Residuals plot to understand it better.


Normal Distribution of Residuals

Another assumption is that the residuals should be normally distributed. You can quickly verify that using a histogram,

plt.hist(residuals_rm,bins=20)


##Output###
(array([ 6.,  9.,  9., 10.,  7.,  2.,  0.,  6.,  6.,  4.,  4.,  4.,  6.,  3.,  5.,  5.,  4.,  4.,  3.,  3.]),
 array([-8.055, -7.111, -6.168, -5.224, -4.28 , -3.337, -2.393, -1.45 , -0.506,  0.437,  1.381,  2.324,  3.268,  4.212,  5.155,  6.099,  7.042,
         7.986,  8.929,  9.873, 10.816]),
 <a list of 20 Patch objects>)

Or using a Q-Q plot. Q-Q plot is an acronym for Quantile-Quantile plot. The way it is calculated is by comparing the quantiles of the sample against the quantiles of a normal distribution.

# First sort the residuals ( sample )
residuals_rm.sort()

# How big is the dataset ?
len(residuals_rm)

# create a normal distribution as big as the residuals dataset
norm=np.random.normal(0,2,len(residuals_rm))

# and sort it
norm.sort()

# Scatter plot of normal distribution vs residuals
plt.scatter(norm,residuals_rm)
plt.title("Quantile - Quantile plot ( Q-Q )")
plt.xlabel("Normal Distribution")
plt.ylabel("Residuals")

Output

Text(0, 0.5, 'Residuals')

What this plot goes to show is that the residuals are almost normally distributed except at the fringes. So, if you can identify the outliers and trim them, this dataset satisfies the Normal Distribution of Residuals assumption of Linear Regression.


No Multicollinearity

Another big word – but the explanation is a bit tricky as well. Let’s create a dummy dataset to understand this better.

import numpy as np

x = np.random.randint(1,100,100).reshape(-1,1)
y = np.random.randint(1,100,100).reshape(-1,1)
z = x + y

X = np.concatenate((x,y,z),axis=1)
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

# statsmodels always requires the constant to be added as the first column
data = add_constant(X)

Collinearity is predicted using a parameter called VIF, the formula for which is

where Rsq2 is the typical r2 of the individual predictor when regressed (predicted) against the rest of the predictors. So, in the example above, VIF value of zbecomes

for i in range(1,4) :
    vif = variance_inflation_factor(data,i)
    print ( "vif", " = ", vif )
vif  =  inf
vif  =  inf
vif  =  inf

Try this with all of the following combinations

Point being, as long as there is a relationship between the variables, it is displayed with a VIF or Variance Inflation Factor > 2.

Collinearity and correlation are different though. Look at the picture below. There is some correlation between

  • Total Power Consumption <–> Number of rooms
  • Total Power Consumption <–> Power consumption per room

But there is a perfect collinearity between

  • Total Power Consumption <–> Number of rooms & Power consumption/room
Multicollinearity
Multicollinearity

Correlation is between 2 variables, while collinearity is between a single variable and a combination of variables.

Now that we understand collinearity, let’s find out how many variables have a collinear relationship in the Boston Housing dataset.

from sklearn.datasets import load_boston
boston = load_boston()

data = add_constant(boston.data)
for i in range(1,14) :
    vif = variance_inflation_factor(data,i)
    print ( vif, " - VIF for ",boston.feature_names[i-1])


##Output##
1.7921915474332406  - VIF for  CRIM
2.2987581787494418  - VIF for  ZN
3.9915964183460297  - VIF for  INDUS
1.0739953275537886  - VIF for  CHAS
4.393719847577493  - VIF for  NOX
1.9337444357832574  - VIF for  RM
3.1008255128153372  - VIF for  AGE
3.9559449063727263  - VIF for  DIS
7.484496335274472  - VIF for  RAD
9.00855394759707  - VIF for  TAX
1.7990840492488978  - VIF for  PTRATIO
1.3485210764063758  - VIF for  B
2.9414910780919366  - VIF for  LSTAT

So, all of the following variables have a high collinearity.

  • 7.484496335274472 – VIF for RAD
  • 9.00855394759707 – VIF for TAX

How to fix Collinearity?

  • Iteratively remove the predictor with the highest VIF. So, in this case, remove either TAX or PTRATIO and do a re-run.
  • Use dimentionality reduction ( say PCA ) to reduce the number of predictors ( More on PCA & dimensionality reduction later)
np.set_printoptions(precision=3, suppress=True, linewidth=250)

data = np.delete(boston.data,8,axis=1)
data = add_constant(data)
feature_names = np.delete(boston.feature_names,8)
for i in range(1,13) :
    vif = variance_inflation_factor(data,i)
    print ( vif, " - VIF for ",feature_names[i-1])


##Output##
1.664471491773954  - VIF for  CRIM
2.273018308327798  - VIF for  ZN
3.682264743037106  - VIF for  INDUS
1.0615610963150082  - VIF for  CHAS
4.304929331571053  - VIF for  NOX
1.8854245633324191  - VIF for  RM
3.0830086748912335  - VIF for  AGE
3.954950605693307  - VIF for  DIS
3.4152890003342686  - VIF for  TAX
1.7348731617273456  - VIF for  PTRATIO
1.34145900343612  - VIF for  B
2.9377524102330312  - VIF for  LSTAT

Now, all the variables have a VIF of less than 5 ( Less than 2 or 3 is ideal though )

%d bloggers like this: