Regression
Summary : Regression is a Machine Learning Technique in which we estimate something (specifically numeric values) based on past experience. The purpose of this section is to just cover the basics of regression along with some statistical methods. Consider this more of an introduction to regression rather than covering all the regression techniques in Machine Learning.
We will be learning Simple Linear Regression, Multiple Linear Regression and Polynomial Regression in this section. On top of the basic technique itself, we will also be covering many statistical concepts (like Null Hypothesis, p-value, r-2, RMSE ) and other key concepts in machine learning like Feature Selection, Training and Test data splits etc that will be useful in evaluating models going forward as well.
Contents
- What is Regression
- Why Regression
- Solve a Regression problem in Python
- Simple Linear Regression
- Multilinear Regression
- Correlation
- p value
- Key Model Evaluation Parameters
- Feature Selection
- Accuracy of the model
- Polynomial Regression
- Overfitting
- Linear Regression Assumptions
- Challenges
What is Regression
In Machine Learning, most problems are classified as supervised vs unsupervised learning. We will first focus on supervised learning algorithms and later work on unsupervised learning algorithms. Supervised learning is once again split into the following 2 groups
- Classification
- Regression
Given a particular height and weight, classify the person as either male or female. This is an example of classification. You are essentially trying to classify the person – in this case – as male or female based on certain characteristics.
In contrast, say you are trying to predict the body fat percentage based on height and weight – this is an example of a regression problem. What is the difference ? Body Fat % is a continuous variable – say it starts at a minimum of 2% (really lean) and can go all the way up to 50 % say (extremely obese) – as opposed to a categorical variable in the example above ( Male or Female ).
Why Regression
If you are learning how to solve a regression problem for the first time, you probably need to understand why you need regression to start with. This is probably the simplest of the regression problems. Let’s start with a simple data set – Swedish auto insurance claims. You can google it or get it from kaggle. It is a very small data set with just 2 variables –
- Number of Claims
- Total payment for these claims ( in thousands )

Claims come first and the settlement happens much later. Assuming these are claims this company receives per week, is there a way we can predict how much the company will end up paying, just based on the number of claims ?
What value does this bring to the company ?
Being able to predict the payment based on the number of claims gives a very good understanding of the companies expenses very much in advance.
Why do you need machine learning for this problem ?
Each claims is different – A fender bender claims costs a thousand dollars and a total could cost 20 grand. The type of claim does make for a good predictor, but let’s just assume we don’t have that at this point. Even if we had the type of claim, a fender bender can cost anywhere from 300 to 2000 dollars based on the severity of damage, the arbitration and several environmental factors. Essentially, there is no easy way to correlate the claims to the payment. If we tried to do this using some kind of IF, THEN logic, we would be going around in hoops.
Solve Regression in Python
import pandas as pd
data = pd.read_csv("../../data/insurance.csv", skiprows = 6, header = None, names=["claim_count","claim_value"])
data.head()
Output
claim_count claim_value
0 108 392.5
1 19 46.2
2 13 15.7
3 124 422.2
4 40 119.4
Looking OK – but since we are reading data from file, we have to ensure that Python is not reading integers as strings or other object types. Let’s quickly verify if the data types are correct.
data.dtypes
Output
claim_count int64
claim_value float64
dtype: object
Looking good. Now, onto LinearRegression. Before we do that, we would have to install the Python package – scikit-learn.
pip install scikit-learn
# or
conda install scikit-learn
The module that we need to solve Linear Regression is LinearRegression from linear_model package.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit(data["claim_count"],data["claim_value"])
-------------------------------------------------------------------------- ValueError: Expected 2D array, got 1D array instead: array=[108 19 13 124 40 57 23 14 45 10 5 48 11 23 7 2 24 6 3 23 6 9 9 3 29 7 4 20 7 4 0 25 6 5 22 11 61 12 4 16 13 60 41 37 55 41 11 27 8 3 17 13 13 15 8 29 30 24 9 31 14 53 26]. Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.
model.fit(data[["claim_count"]],data["claim_value"])
Output
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Our model is ready. Let’s start predicting claims based on the count of claims. We have used the fit method to fit the model – to predict, we will be using the predict method. But before we do that, let’s plot this out to understand what we have done so far.
predict = model.predict(data[["claim_count"]])
slope = model.coef_ slope
Output
array([3.41382356])
intercept = model.intercept_
intercept
Output
19.99448575911478
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter( data["claim_count"],data["claim_value"])
Output

Linear Regression has already solve this problem for us – we just didn’t realize it yet. The parameters ( also called co-efficients )
- slope
- intercept
are the solution.
point_1 = slope*0 + intercept
point_2 = slope*120 + intercept
print ( point_1, point_2)
plt.scatter( data["claim_count"],data["claim_value"])
plt.plot([0,120], [point_1,point_2])
Output
[19.99448576] [429.65331297]

How did we get the straight line ?

A straight line can be defined mathematically using
y = ax + b
where
- a = slope
- b = intercept
These are also called coefficients. The fit function of LinearRegression has already arrived at these numbers ( slope and intercept ). It has done so based on the data
slope = model.coef_
intercept = model.intercept_
print ( "a (slope) = ",slope)
print ( "b (intercept) = ", intercept)
Output
a (slope) = [3.41382356]
b (intercept) = 19.99448575911478
What did we achieve ?
What we have essentially done is predicted a relationship between the number of claims and the total amount paid. For example, what is the total amount expected to be paid when the number of claims is 80 ?

Easy, right ?
Prediction
We don’t have to draw lines like this every time to predict the value of Y for a value of X. You can use the predict ( ) function.
claim_values = model.predict (data[["claim_count"]])
or more specifically,
claim_values = model.predict (pd.DataFrame([10,20,30,40,60]))
claim_values
Output
array([ 54.13272136, 88.27095696, 122.40919256, 156.54742816,
224.82389936])
plt.scatter( data["claim_count"],data["claim_value"])
plt.plot([0,120], [point_1,point_2])
plt.scatter([10,20,30,40,60], claim_values , marker='o')
Output

What were the original values though ? You can pick them up from the CSV.

Original_claim_values = [65.3,98.1,194.5,119.4,202.4]
Let’s also plot these to compare how well we predicted.In [20]:
plt.scatter( data["claim_count"],data["claim_value"])
plt.plot([0,120], [point_1,point_2])
plt.scatter([10,20,30,40,60], claim_values , marker='o') plt.scatter([10,20,30,40,60], original_claim_values , marker='o')
Output

Some values are pretty close, some are a bit off – but nevertheless its a good prediction for the amount of time we spent doing it.
Simple Linear Regression
What we have seen so far is an example of implementing Simple Linear Regression in Python. What we will see in this section is the math behind it – Give it a try – if it gets tedious, please do a couple of re-reads. It is critical that you understand this section.
How did Linear Regression fit the straight line
The fun starts now. How did the LinearRegression ( ) function fit the straight line ? How did it it arrive at this equation
y = 3.4 x + 19.99
Obviously, there is no one straight line that will pass through all the points in this case.

If you take these 4 data points, we can eyeball a straight line that goes through the middle of it. The ones marked with question marks are visually not a good fit at all. But the question that linear model tries to solve is,
What is the “Optimum” straight line that best describes the relationship between X and Y
This is where statistics comes in.
Let’s zoom in
Let’s simplify and make up some numbers ( for easy calculation) of claims vs payments. Say we have a set of 5 data points for claims vs payments and we wish to fit a linear model that can predict further data. This is a very small data set to do anything practical, but there is a reason why we are doing such a small data set as we will see in the coming sections.

Let’s plot these on a graph.

If we were asked to eyeball a straight line that best fits these data points, this is how we would do it.

How did we do it ? Our eye is essentially trying to minimize the distances from each of these points to the straight line. The best fitting straight line is one which minimizes the distances for all these points.
Linear regression in machine learning does exactly that – Instead of a human eye, machine learning takes the help of statistics to do this approximation. There are a couple of methods to do this in statistics.
- Ordinary Least Squares
- Gradient Descent
Let’s explore the first method here.
Residuals
When we tried to minimize the distance of each of these points from the line we are trying to fit, the distances between the points and the straight line ( on the y axis ) are called residuals.

Sum of Squares
Warning – Geeky Statistics stuff
To arrive at the best fit values for the straight line, statisticians have arrived at the following formula based on the method of least squares. How they arrived at it is pretty geeky and let’s leave that to the statisticians.

This equation sounds scary, but it is not. I am going to prove it to you in a minute down below. There are 2 things in this equation that require an explanation.
- The weird symbol that looks like a knocked up W . It is used for summation.
- y with a bar on it ( or x with a bar ). The bar just represents the average. So y with a bar on it represents the average.
Let’s take the same numbers that we have above and try to calculate the formula by hand. Excel would make things easy, but let’s just do it manually, since the numbers are not all that bad.

That was huge – Now we can easily calculate a and b from the Y = a + bx
Validation
Let’s cross validate this equation.
sample = pd.DataFrame( {'x' : [20,40,60,80,100],'y':[40,60,80,80,90]})
sample
Output
x y
0 20 40
1 40 60
2 60 80
3 80 80
4 100 90
Let’s model this data and plot it
model.fit(sample[["x"]],sample["y"])
Output
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)
plt.scatter( sample["x"],sample["y"])
Output

Now, let’s use our model to visualize the fit.
slope = model.coef_
intercept = model.intercept_
print ( "a (slope) = ",slope)
print ( "b (intercept) = ", intercept)
a (slope) = [0.6]
b (intercept) = 34.0
These are the slope and intercept values. We can now use these to plot the fit line.
point_1 = slope*0 + intercept
point_2 = slope*120 + intercept
print ( point_1, point_2)
plt.scatter( sample["x"],sample["y"])
plt.plot([0,120], [point_1,point_2])
[34.] [106.]

The model object has a whole host of other information that you can use to predict how good of a fit the line is to the data. But first, let’s predict the value in a table.
pred_y = model.predict (sample[["x"]]) pred_y
Output
array([46., 58., 70., 82., 94.])
The differences ( residuals ) are highlighted below.

There are so many other parameters ( like the p-value, r-squared, r-squared adjusted ) and graphs ( Residuals vs fitted , Q-Q Plot etc ) that are used to analyze the performance of a model fit. We are going to get to that in the next section. To set the stage for these parameters, let’s scale up our data set.
Multilinear Regression
So far we have seen one predictor and one response variable. This is also called simple linear regression. Ofcoure, real world is not that simple, right ? There can be multiple predictors. We are going to look at one such example in a classic example – Boston Housing dataset.
It has 13 predictors and 1 response variable. So, the equation for that mathematically would be

the value of n would be 13 in this case. Let’s look at the dataset below.
Boston Housing dataset
Predicting the price of a house is based on many parameters like the size, neighborhood, crime rate, pollution etc. Essentially, there is no mathematical equation that can predict the price of a house based on these parameters – that’s essentially where ML comes into the picture.
In the earlier example, we just had to predict the value of Y given a value of X. There is just 1 predictor ( X ). However, let’s have a look at the Boston Housing data set – tons of variables. Load the data first.
There are a couple of ways to load the data.
- Download it from UCI ML repository
- Download it from our github site
- Load it straight from scikit-learn library
Download and load from file system
import pandas as pd
import numpy as np
boston_housing = pd.read_csv("../../data/boston_housing.csv") boston_housing.head()
Output

The attribute names are a bit cryptic. So, here are the descriptions.
- CRIM per capita crime rate by town
- ZN proportion of residential land zoned for lots over 25,000 sq.ft.
- INDUS proportion of non-retail business acres per town
- CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- NOX nitric oxides concentration (parts per 10 million)
- RM average number of rooms per dwelling
- AGE proportion of owner-occupied units built prior to 1940
- DIS weighted distances to five Boston employment centres
- RAD index of accessibility to radial highways
- TAX full-value property-tax rate per USD – 10,000
- PTRATIO pupil-teacher ratio by town
- B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT % lower status of the population
- MEDV Median value of owner-occupied homes in USD 1000’s
Load it directly from scikit learn module
from sklearn.datasets import load_boston
boston = load_boston()
type(boston)
Output
sklearn.utils.Bunch
bunch is a scikit learn object used for loading in-built data packages – like Boston Housing or Iris. To view the data, just do
np.set_printoptions(precision=1, suppress=True, linewidth=250)
# Show just the top 10 rows in the dataset
boston.data[1:10,:]
Output
array([[ 0. , 0. , 7.1, 0. , 0.5, 6.4, 78.9, 5. , 2. , 242. , 17.8, 396.9, 9.1],
[ 0. , 0. , 7.1, 0. , 0.5, 7.2, 61.1, 5. , 2. , 242. , 17.8, 392.8, 4. ],
[ 0. , 0. , 2.2, 0. , 0.5, 7. , 45.8, 6.1, 3. , 222. , 18.7, 394.6, 2.9],
[ 0.1, 0. , 2.2, 0. , 0.5, 7.1, 54.2, 6.1, 3. , 222. , 18.7, 396.9, 5.3],
[ 0. , 0. , 2.2, 0. , 0.5, 6.4, 58.7, 6.1, 3. , 222. , 18.7, 394.1, 5.2],
[ 0.1, 12.5, 7.9, 0. , 0.5, 6. , 66.6, 5.6, 5. , 311. , 15.2, 395.6, 12.4],
[ 0.1, 12.5, 7.9, 0. , 0.5, 6.2, 96.1, 6. , 5. , 311. , 15.2, 396.9, 19.1],
[ 0.2, 12.5, 7.9, 0. , 0.5, 5.6, 100. , 6.1, 5. , 311. , 15.2, 386.6, 29.9],
[ 0.2, 12.5, 7.9, 0. , 0.5, 6. , 85.9, 6.6, 5. , 311. , 15.2, 386.7, 17.1]])
Since all of the data is numeric in nature, it is set up as a NumPy object. What else is available ? The Bunch object is more or less like a dictionary. so, you can use the keys ( ) method to get all the keys.In [40]:
Input
boston.keys()
Output
dict_keys(['data', 'target', 'feature_names', 'DESCR', 'filename'])
Now, let’s see the features
Input
boston.feature_names
Output
array(['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT'], dtype='<U7')
How large is the dataset ?
Input
boston.data.shape
Output
(506, 13)
so, it is 506 rows and 13 columns.
And here is the description of what these columns represent.
Input
print ( boston.DESCR )
# The reason why we are doing print is because the DESCR value is a formatted string ( with new line characters like \n etc)
# otherwise, you get unformatted text
.. _boston_dataset: Boston house prices dataset --------------------------- **Data Set Characteristics:** :Number of Instances: 506 :Number of Attributes: 13 numeric/categorical predictive. Median Value (attribute 14) is usually the target. :Attribute Information (in order): - CRIM per capita crime rate by town - ZN proportion of residential land zoned for lots over 25,000 sq.ft. - INDUS proportion of non-retail business acres per town - CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) - NOX nitric oxides concentration (parts per 10 million) - RM average number of rooms per dwelling - AGE proportion of owner-occupied units built prior to 1940 - DIS weighted distances to five Boston employment centres - RAD index of accessibility to radial highways - TAX full-value property-tax rate per $10,000 - PTRATIO pupil-teacher ratio by town - B 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town - LSTAT % lower status of the population - MEDV Median value of owner-occupied homes in $1000's
If you observe closely, the actual price column ( MEDV ) that you see in the loaded file ( from pandas.read_csv() ) previously is missing in the scikit learn data. That is because it is available as a separate key in the scikit learn Bunch called target. Let’s see that as well.
Looking back at our equation,


Input
# Just show the top 10 target values
boston.target[1:10]
Output
array([21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, 18.9])
Let’s create a Pandas dataframe from this data as well for further analysis.
Input
boston_housing = pd.DataFrame(boston.data)
# set the columns
boston_housing.columns = boston.feature_names
# add the price
boston_housing["MEDV"] = boston.target
# now, look at the head
boston_housing.head()
Output

Alright, now that we know how to get the dataframe ( via file upload or scikit learn library itself ), let’s quickly do a linear regression on this data.
Response Variable vs Predictor
What is the problem we are trying to solve ? We are actually trying to predict the Median house price based on a variety of parameters – like Crime rate, Nitric Oxide content in the air, age of the house etc. Each of these parameters have a “say” in the price of the house. The target variable is valled the Response Variable and the parameters that have a “say” in determining the target variable are called the Predictors.
In our case, there are 12 predictors ( CRIM, ZN, INDUS…LSTAT) and 1 response variable ( MEDV ). Just for simplicity sake, let’s just pick one parameter – rm – the number of rooms in the house. Let’s see how well we predict the price of the house based on just this one parameter.
from sklearn.linear_model import LinearRegression
model = LinearRegression()
model.fit( boston_housing[["RM"]], boston_housing["MEDV"])
Output
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Let's plot our prediction
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(boston_housing["RM"], boston_housing["MEDV"])
Output

Now, let’s draw the straight line that our model has predicted. The maximum number of rooms in our data is 9 and the minimum is 0. so, let’s determine the co-ordinates to draw our fit line.
slope = model.coef_
intercept = model.intercept_
point_1 = slope*0 + intercept
point_2 = slope*9 + intercept
plt.scatter(boston_housing["RM"], boston_housing["MEDV"])
plt.plot([0,9], [point_1,point_2],color="r")
Output

Looks like a decent enough fit. Let’s do another – lstat – lower status population percentage.
Step 1 – Model the data
model.fit( boston_housing[["LSTAT"]], boston_housing["MEDV"])
Output
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
Step 2 – Determine the slope and intercept from the model
slope = model.coef_
intercept = model.intercept_
point_1 = slope*0 + intercept
point_2 = slope*9 + intercept
Step 3 – Plot the data and fit line
plt.scatter(boston_housing["LSTAT"], boston_housing["MEDV"])
plt.plot([0,9], [point_1,point_2],color="r")
Output

The line seems off, right ? Why is that ? How did we determine the points ? (Based on the older data – number of rooms). The range of data for LSTAT on the other hand starts at 0 and ends at 40. So, lets re-determine the line.
slope = model.coef_
intercept = model.intercept_
point_1 = slope*0 + intercept
point_2 = slope*40 + intercept
plt.scatter(boston_housing["LSTAT"], boston_housing["MEDV"])
plt.plot([0,40], [point_1,point_2],color="r")
Output

This seems accurate enough. However, if you look at both these graphs, the relationship between the predictors and response variable seems a bit different between the predictor LSTAT vs RM. Let’s put these graphs side by side to understand better.


The first picture is lstat vs medv and the second is rm vs medv.

Also, not all variables might be relevant ( irrespective of the direction, decrease or increase ). Let’s take the parameter dis – distance to employment. Once again, if we try to fit this using our steps
# step 1
model.fit( boston_housing[["DIS"]], boston_housing["MEDV"])
# step 2
slope = model.coef_
intercept = model.intercept_
point_1 = slope*0 + intercept
point_2 = slope*12 + intercept
# step 3
plt.scatter(boston_housing["DIS"], boston_housing["MEDV"])
plt.plot([0,12], [point_1,point_2],color="r")
Output

Visually, there is not much of a linear relationship between distance and median value ( although we tried to fit it ). How exactly do we measure the relationship ?
Correlation
This is the simplest measure of relation between 2 numeric variables. Luckily, pandas provides a built-in method for calculating correlation – corr ( ). For example,
boston_housing["MEDV"].corr(boston_housing["DIS"])
Output
0.24992873408590394
boston_housing["MEDV"].corr(boston_housing["LSTAT"])
Output
-0.7376627261740147
This seems in-line with our plots above right ?
Correlation is NOT causation
However strong the correlation is, it does NOT imply causation. This is a bit tricky to understand. For example, look at the picture below.
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(boston_housing["RM"],boston_housing["MEDV"])

There seems to be a significant correlation between the number of rooms ( rm ) and the median price value (medv).

Now, imagine another column – Power Consumption . As the number of rooms increase, the power consumption also tends to increase.

Does it mean that Power Consumption has a strong correlation with the house price ? Very unlikely – isn’t it ? Mathematically, correlation is just a tool that signifies the level of correlation between 2 variables. However, it is upto us (or the domain expert) to determine if the correlation is real or ficticious.
Question : The higher the correlation, lower the stronger is the relationship between the variables.
x = [10,20,30,40,50,60,70,80,90,100]
y = [100,90,80,70,60,50,40,30,20,10]
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x,y)

p value
Also called probability value, p-value answers the following question –
If a predictor is relevant to predict the response and if yes, how relevant is it ?
We have to understand p-value in the context of Null Hypothesis.
Null Hypothesis
Null hypothesis (denoted as H0 assumes that there is no relationship between the predictor and the response variable. For example, if you look at the relationship between the number of rooms ( rm ) and the median price value of the home ( medv ) in the Boston Housing dataset, Null hypothesis says that there is NO relationship between them.

Well, although we can see a linear relationship visually ( almost ) between those 2 variables, we start off with the Null Hypothesis. It is indicated in statistics as H0
Alternate hypothesis indicates that they are related. It is indicated as H1 . P-value indicates how much the observed data is inconsistent with the Null Hypothesis. This is a bit cryptic, right ? Let’s explore this further.
Let’s just consider 2 random variables that are normally distributed. Since they are random variables, there would be no relationship between them, right ? Let’s check it out.
Dataset 1
A normal distribution of 100 values with a mean of 100 and sd of 20
import numpy as np
x = np.random.normal(100, 20, 20)
x
Output
array([105.9, 97.7, 97.3, 135.9, 76.2, 86.7, 109.2, 40.8, 117.6, 127. , 82. , 127.8, 107.3, 83. , 98.8, 102.4, 116.8, 97.3, 95.2, 73.6])
Dataset 2
Another normal distribution of 100 values with a mean of 100 and sd of 20
y = np.random.normal(100, 20, 20)
y
Output
array([ 65.2, 137. , 111.1, 93.4, 69.9, 121.5, 126.7, 97.1, 108.9, 83.8, 116.1, 118.3, 92.3, 118.1, 67.3, 89.3, 108.4, 142.4, 136. , 87.7])
Let’s plot x vs y and see how it looks.
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x,y)
Output

This almost looks like the night sky, isn’t it ? Point being, there is no relationship between x and y as they are random variables – That is pretty understandable right ? Let’s try to calculate a relationship between these two variables ( although there is none ) and see how it compares against another situation where there IS actually a relationship.
from sklearn.linear_model import LinearRegression
import pandas as pd
# Step 1 - Model
model = LinearRegression()
model.fit( pd.DataFrame(x), y)
Output
LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
normalize=False)
# step 2
slope = model.coef_
intercept = model.intercept_
point_1 = slope*0 + intercept
point_2 = slope*120 + intercept
# step 3
plt.scatter(x, y)
plt.plot([0,120], [point_1,point_2],color="r")
Output

What is the p-value in this case ?
Unfortunately, the standard scikit learn’s LinearRegression() object does not provide p-value. There are other libraries that can help us with determining the p-value. For example, let’s use the scipy library.
rom scipy import stats
import numpy as np
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
p_value
Output
0.9469781750196463
And the p-value in this case is 0.89.
p-value is 0.89 – that’s 89 %. p-value is always between 0 and 1. 0.89 is a big number right ? Does it indicate that there is a strong relationship between x and y in this case ? On the contrary, a high p-value indicates that the there is NO relationship between x and y – which is what the Null Hypothesis states.
On the other hand, if we calculate the p-value of the relationship between the number of rooms (“RM”) and median price (“MEDV”), then you get a very small value.
from scipy import stats
import numpy as np
slope, intercept, r_value, p_value, std_err = stats.linregress(boston.data[:,5],boston.target)
p_value
Output
2.4872288710080936e-74
p-value in this case is 2.487 x 10-74. That is an extremely low value ( 0.0000..00248 ) .
What it means is that there is a 0.0000..00248 % chance that the correlation is random.
or in other words, there is a 99.99999 % chance that the data does represent a relationship between the two variables. Whereas in the case of the random variales, there is a 89% chance that the correlation is random
Optimum p-value
A value below which we can safely conclude that the relationship between 2 variables did not happen by chance, but because there is a true causal relationship, is called an optimum p-value.
Is there a fixed p-value below which, we can safely conclude a strong relationship (Alternate Hypothesis) ?
Typically p <= 0.05 is accepted in most cases. However, the actual value at which the business is comfortable is based on a variety of parameters like
- type of data
- level of confidence the business requires etc
Key Model Evaluation Parameters
r-squared – r2
R-squared is a measure of the explanatory power of the model – How well are the variables able to explain the response variables using the model. A model’s r2 varies between 0 and 1. Unlike p-value, scikit provides r2 values ( without having to reply on scipy ).
Method 1 – Using scipy
from scipy import stats
import numpy as np
slope, intercept, r_value, p_value, std_err = stats.linregress(boston.data[:,5],boston.target)
print ( "r-squared = ",r_value**2)
Output
r-squared = 0.48352545599133373
Method 2 – Using scikit learn
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston
boston = load_boston()
print ( boston.feature_names)
X = boston.data
y = boston.target.reshape(-1,1)
# Fit model based on RM - Number of rooms
model = LinearRegression().fit(X[:,[5]],y)
r2_score = model.score(X[:,[5]],y)
print ( "r-squared = ",r2_score)
Output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
r-squared = 0.4835254559913343
In the example above, we tried to model the response variable medv ( median house value ) as a measure of the number of rooms ( rm ) – the r2 value is 0.4835 . It is a measure of how well the model explains the relationship. A low value of r2 ( r2 = 0 ) means that the explanatory variables are not able to predict the response variable well
import numpy as np
x = np.random.normal(100, 20, 20)
y = np.random.normal(100, 20, 20)
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print ( "r-squared = ",r_value**2)
Output
r-squared = 0.3178323469801873
For example, if you look at the random value example, the r2 value is 0.00725 – which is very close to 0. What this means is that the explanatory variable (x) is not able to explain the variance in the response variable (y).
A high value or r2 ( r2= 1 ) means that the explanatory variables are fully able to predict the response – In this case the number of rooms ( rm ) is able to explain the variance in the median house value around 48 %. The remaining 52% variance is unexplained by the explanatory variable.
How is r2 calculated
The formula to calculate r2 is

The denominator Sum of Squares total is the worst possible score. The numerator Sum of Squares residuals is how far the model is performing. So r 2 is essentially a normalized score of how well the model is predicting the data – in comparision to the worst possible score.
What do we mean by worst possible score ? Well, if we know the weights of each of the individuals in the class, without any indicator of whose that weight is ( just a list of weights ), what is our best possible prediction of weight for anybody in the class ? We just do an arithmetic average – right ? And that’s what we use to calculate the Sum of Squares total .
Let’s calculate r2 by hand for our simple dataset.

Let’s verify this in Python.
x = [20,40,60,80,100]
y = [40,60,80,80,90 ]
slope, intercept, r_value, p_value, std_err = stats.linregress(x,y)
print ( "r-squared = ",r_value**2)
r-squared = 0.8999999999999999
There you go – our manual calculation verifies with Python’s calculation.
Exercise – Calculate the r2 value of of Boston housing dataset for the predictor – NOX ( Level of Nitric Oxide in the air ).
r-squared adjusted
Mathematically, r2 has a peculiar property. Adding more predictors increases the value of r2 . This is not very intuitive to begin with. Let’s try it on our Boston Housing dataset.
In the example above, we have tried to model medv from rm . So, the only explanatory variable is rm ( number of rooms ). Based on that relationship we have an r2 value of 0.4835 . What would happen if we add another explanatory variable ? – say lstat ( percentage of lower status of the population ). Unfortunately, scipy’s scipy.stats.linregress or sklearn.metrics.r2_score cannot deal with more than 1 predictor. For that you need a new library called statsmodels. Just install it using pip or conda.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston
boston = load_boston()
print ( boston.feature_names)
X = boston.data
y = boston.target.reshape(-1,1)
# Fit model based on RM - Number of rooms
model = LinearRegression().fit(X[:,[5]],y)
r2_score = model.score(X[:,[5]],y)
print ( "r-squared = ",r2_score)
Output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
r-squared = 0.4835254559913343
# Fit model based on
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
model = LinearRegression().fit(X[:,[5,12]],y)
r2_score = model.score(X[:,[5,12]],y)
print ( "r-squared = ",r2_score)
Output
r-squared = 0.6385616062603403
See, the r 2 has increased from 0.48 to 0.63 by including a second predictor. How about a third predictor ?
# Fit model based on
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. NOX - Nitric oxide in the air
model = LinearRegression().fit(X[:,[4,5,12]],y)
r2_score = model.score(X[:,[4,5,12]],y)
print ( "r-squared = ",r2_score)
Output
r-squared = 0.6389103767491082
There is a slight increase in r 2 – from 0.6385 to 0.6389.
r2 value – Predictors
0.483525 – number of rooms
0.638561 – number of rooms + lower stata population
0.638910 – number of rooms + lower stata population + Nitric Oxide in the air
You can keep adding as many predictors as you want and you can observe that r 2 value always increases with every predictor. That doesn’t seem right. Isn’t it ? Let’s try something. To the boston housing dataset, let’s add a column with just random numbers and see if r 2 still increases. If it does, then we have a problem.
from sklearn.datasets import load_boston
import numpy as np
boston = load_boston()
# Generate a column of 506 random numbers
x = np.random.normal(100, 20, 506)
x = x.reshape(-1,1)
# and add it to the boston dataset
boston.data = np.append(boston.data,x,axis=1)
# what is the new shape ? It should be 13 columns ( as opposed to the original 12 )
print ( boston.data.shape)
Output
(506, 14)
Now, let’s try the regression with the predictors RM, LSTAT, NOX and the new random variable.
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
X = boston.data
y = boston.target.reshape(-1,1)
model = LinearRegression().fit(X[:,[4,5,12,13]],y)
r2_score = model.score(X[:,[4,5,12,13]],y)
print ( "r-squared = ",r2_score)
Output
r-squared = 0.6395935817018061
r2 value – Predictors
0.483525 – number of rooms
0.638561 – number of rooms + lower stata population
0.638910 – number of rooms + lower stata population + Nitric Oxide in the air
0.639007 – number of rooms + lower stata population + Nitric Oxide in the air + some random variable
This is crazy, right ? Just add any random variable ( which is not supposed to have any predictive power) and still the r 2 increases ? You are probably beginning to doubt the predictive power of r 2 in the first place. Well, it’s not all bad with r 2. Just that every random variable has some predictive power. In order to counter this, there is a new variable called r2 adjusted.

where
- n = sample size
- p = number of predictors
Essentially, the new parameter r2 adjusted works by penalizing more parameters. That is why p – the number of predictors is in the denominator.
Unfortunately, scikit learn doesn’t have a dedicated calculation for r2 adjusted. We would have to calculate that on our own. Let’s write a quick function for the same.
def r2_adjusted ( r2, n, p ) :
r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
return r2_adjusted
Now, let’s rerun those 4 scenarios again.
def r2_adjusted ( r2, n, p ) :
r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
return r2_adjusted
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from sklearn.datasets import load_boston
boston = load_boston()
print ( boston.feature_names)
X = boston.data
y = boston.target.reshape(-1,1)
# Fit model based on RM - Number of rooms
model = LinearRegression().fit(X[:,[5]],y)
r2_score = model.score(X[:,[5]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],1)
print ( "r-squared = ",r2_score)
print ( "r-squred adjusted = ", r2_adjusted)
Output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
r-squared = 0.4835254559913343
r-squred adjusted = 0.48250070491195196
def r2_adjusted ( r2, n, p ) :
r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
return r2_adjusted
# Fit model based on
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
model = LinearRegression().fit(X[:,[5,12]],y)
r2_score = model.score(X[:,[5,12]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],2)
print ( "r-squared = ",r2_score)
print ( "r-waured adjusted = ", r2_adjusted)
Output
r-squared = 0.6385616062603403
r-waured adjusted = 0.6371244754701231
def r2_adjusted ( r2, n, p ) :
r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
return r2_adjusted
# Fit model based on
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. NOX - Nitric oxide in the air
model = LinearRegression().fit(X[:,[4,5,12]],y)
r2_score = model.score(X[:,[4,5,12]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],2)
print ( "r-squared = ",r2_score)
print ( "r-waured adjusted = ", r2_adjusted)
Output
r-squared = 0.6389103767491082
r-waured adjusted = 0.6374746327202776
Now, add the random variable.
from sklearn.datasets import load_boston
import numpy as np
boston = load_boston()
# Generate a column of 506 random numbers
x = np.random.normal(100, 20, 506)
x = x.reshape(-1,1)
# and add it to the boston dataset
boston.data = np.append(boston.data,x,axis=1)
# what is the new shape ? It should be 13 columns ( as opposed to the original 12 )
print ( boston.data.shape)
X = boston.data
Output
(506, 14)
def r2_adjusted ( r2, n, p ) :
r2_adjusted = 1 - ( (1-r2) * (n-1) / (n-p-1))
return r2_adjusted
# Fit model based on
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. NOX - Nitric oxide in the air
# 4. Random variable
model = LinearRegression().fit(X[:,[4,5,12,13]],y)
r2_score = model.score(X[:,[4,5,12,13]],y)
r2_adjusted = r2_adjusted(r2_score,X.shape[0],2)
print ( "r-squared = ",r2_score)
print ( "r-waured adjusted = ", r2_adjusted)
Output
r-squared = 0.639463625000628
r-waured adjusted = 0.6380300807660381
r2 value – r2-adjusted – Predictors
0.483525 – 0.482500 – number of rooms
0.638561 – 0.637124 – number of rooms + lower stata population
0.638910 – 0.637474 – number of rooms + lower stata population + Nitric Oxide in the air
0.639007 – 0.637672 – number of rooms + lower stata population + Nitric Oxide in the air + some random variable
What it goes to show is that, adding more predictors does not necessarily increase the explanatory power of the model. r2 adjusted accommodates for this by incorporating a penalty for the number of predictors ( more the predictors, lesser the r2 adjusted ).
When you add a predictor it should add significant value. If it is not ( For example, adding NOX – Nitric Oxide as a predictor ) r2-adjusted tells you that it is a worthless predictor and you better get rid of it.
RMSE – Root Mean Square error
While r2 is a relative measure of fit, RMSE is an absolute measure of fit. Lower values indicate a better fit and higher values not. Calculating RMSE is quite simple – it is quite similar to the standard deviation (the square root of variance). While Standard Deviation ( sd ) is a measure of how far the distribution is from the mean, RMSE is a measure of how far the fit is from the actuals.

So, to calculate the RMSE, you just have to borrow the residuals from the model and do a couple of mathematical operations ( quite similar to how you do them to the differences from the mean in the case of mean )
# from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()
print ( boston.feature_names)
X = boston.data
y = boston.target.reshape(-1,1)
# Fit model based on RM - Number of rooms
model = LinearRegression().fit(X[:,[5]],y)
r2_score = model.score(X[:,[5]],y)
print ( "r-squared = ",r2_score)
y_predict = model.predict(X[:,[5]])
mse = mean_squared_error(y,y_predict)
rmse = sqrt(mse)
print ( "RMSE = ",rmse)
# rms = sqrt(mean_squared_error(y, y_predict))
Output
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
r-squared = 0.4835254559913343
RMSE = 6.603071389222561
Now, you might have a question here. When you already have the RMSE to measure the fit, why do you need another metric – r2 ? Well, you can’t measure the growth rate of an elephant and a horse in absolute terms. RMSE is in the magnitute of the actual response variable, while r2 is on a uniform scale ( 0 to 1 ).
Feature Selection
In the Boston Housing dataset example, there are 13 predictors and 1 response variable.
from sklearn.datasets import load_boston
boston = load_boston()
print ( boston.feature_names )
['CRIM' 'ZN' 'INDUS' 'CHAS' 'NOX' 'RM' 'AGE' 'DIS' 'RAD' 'TAX' 'PTRATIO' 'B' 'LSTAT']
And the response variable is MEDV – Median house value.
Previously, we have seen some examples of predicting the house value based on a random set of predictors –
- RM – Number of rooms
- LSTAT – Percentage of people with lower status
- NOX – Nitric oxide content etc.
However, we know that not all these variables have an equal say. So, how do we know which of the predictors have the most impact in predicting the house price ?
The process of identifying the features ( predictors ) that have the most predictive power is called Feature Selection
This process is called Stepwise Regression. Let’s explore it in the next section.
Stepwise Regression
In stepwise regression, we select a parameter to determine the fit. Based on how well the model is being fit, we add or keep removing predictors ( features ) until we come to a point where we can longer improve the model based on the selected parameter.
What kind of parameters can be used to evaluate the model ? There are many choices like
- p-value
- r-squared
- r-squared (adjusted )
- F-tests or T-tests etc
For now, since we are already aware of p-value let’s choose it as our selection criteria parameter.
A better library to show p-values for individual features(predictors) is statsmodels . Let’s install it.
pip install statsmodels
import statsmodels.api as sm
import numpy as np
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
y = boston.target
# precision = 3 - only show until 3 decimals
# suppress = True - suppresses scientific notation ( e-00x )
# linewidth - expands the default width ( 75 ) of printing each line
np.set_printoptions(precision=3, suppress=True, linewidth=150)
print(boston.data[0:5,:])
Output
[[ 0.006 18. 2.31 0. 0.538 6.575 65.2 4.09 1. 296. 15.3 396.9 4.98 ]
[ 0.027 0. 7.07 0. 0.469 6.421 78.9 4.967 2. 242. 17.8 396.9 9.14 ]
[ 0.027 0. 7.07 0. 0.469 7.185 61.1 4.967 2. 242. 17.8 392.83 4.03 ]
[ 0.032 0. 2.18 0. 0.458 6.998 45.8 6.062 3. 222. 18.7 394.63 2.94 ]
[ 0.069 0. 2.18 0. 0.458 7.147 54.2 6.062 3. 222. 18.7 396.9 5.33 ]]
model = sm.OLS(y,X).fit()
model.summary()
Output
OLS Regression Results
Dep. Variable: y R-squared: 0.959
Model: OLS Adj. R-squared: 0.958
Method: Least Squares F-statistic: 891.3
Date: Wed, 01 May 2019 Prob (F-statistic): 0.00
Time: 19:32:44 Log-Likelihood: -1523.8
No. Observations: 506 AIC: 3074.
Df Residuals: 493 BIC: 3128.
Df Model: 13
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 -0.0929 0.034 -2.699 0.007 -0.161 -0.025
x2 0.0487 0.014 3.382 0.001 0.020 0.077
x3 -0.0041 0.064 -0.063 0.950 -0.131 0.123
x4 2.8540 0.904 3.157 0.002 1.078 4.630
x5 -2.8684 3.359 -0.854 0.394 -9.468 3.731
x6 5.9281 0.309 19.178 0.000 5.321 6.535
x7 -0.0073 0.014 -0.526 0.599 -0.034 0.020
x8 -0.9685 0.196 -4.951 0.000 -1.353 -0.584
x9 0.1712 0.067 2.564 0.011 0.040 0.302
x10 -0.0094 0.004 -2.395 0.017 -0.017 -0.002
x11 -0.3922 0.110 -3.570 0.000 -0.608 -0.176
x12 0.0149 0.003 5.528 0.000 0.010 0.020
x13 -0.4163 0.051 -8.197 0.000 -0.516 -0.317
Omnibus: 204.082 Durbin-Watson: 0.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1374.225
Skew: 1.609 Prob(JB): 3.90e-299
Kurtosis: 10.404 Cond. No. 8.50e+03
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.5e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Since we have to chosen p-value as our key assesment criteria, we are interested in this part of the table.

Irrespective of the selection criteria, there are 2 basic methods in stepwise regression.
- Backward Elimination
- Forward Selection
Backward Elimination
As we see from the table above, not all predictors have the same p-value. So, in backward elimination, we start by eliminating the predictor with the worst parameter value (p-value in this case) and re-evaluate the model again.

# Eliminate INDUS predictor.
X = boston.data[:,[0,1,3,4,5,6,7,8,9,10,11,12]]
model = sm.OLS(y,X).fit()
model.summary()
OLS Regression Results
Dep. Variable: y R-squared: 0.959
Model: OLS Adj. R-squared: 0.958
Method: Least Squares F-statistic: 967.5
Date: Wed, 01 May 2019 Prob (F-statistic): 0.00
Time: 19:32:51 Log-Likelihood: -1523.8
No. Observations: 506 AIC: 3072.
Df Residuals: 494 BIC: 3122.
Df Model: 12
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 -0.0928 0.034 -2.701 0.007 -0.160 -0.025
x2 0.0488 0.014 3.412 0.001 0.021 0.077
x3 2.8482 0.898 3.171 0.002 1.083 4.613
x4 -2.9275 3.222 -0.909 0.364 -9.258 3.403
x5 5.9318 0.303 19.555 0.000 5.336 6.528
x6 -0.0073 0.014 -0.527 0.598 -0.034 0.020
x7 -0.9655 0.189 -5.099 0.000 -1.337 -0.593
x8 0.1723 0.064 2.687 0.007 0.046 0.298
x9 -0.0095 0.004 -2.693 0.007 -0.016 -0.003
x10 -0.3930 0.109 -3.607 0.000 -0.607 -0.179
x11 0.0149 0.003 5.544 0.000 0.010 0.020
x12 -0.4165 0.051 -8.225 0.000 -0.516 -0.317
Omnibus: 204.123 Durbin-Watson: 0.999
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1374.966
Skew: 1.609 Prob(JB): 2.69e-299
Kurtosis: 10.406 Cond. No. 8.16e+03
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 8.16e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
After having eliminated features in a step-wise fashion, we are left with the following parameter with low enough values.
# Include only CHAS,RM,DIS,PRATIO,B,LSTAT
X = boston.data[:,[3,5,7,10,11,12]]
model = sm.OLS(y,X).fit()
model.summary()
Output
OLS Regression Results
Dep. Variable: y R-squared: 0.957
Model: OLS Adj. R-squared: 0.956
Method: Least Squares F-statistic: 1853.
Date: Wed, 01 May 2019 Prob (F-statistic): 0.00
Time: 19:32:55 Log-Likelihood: -1537.2
No. Observations: 506 AIC: 3086.
Df Residuals: 500 BIC: 3112.
Df Model: 6
Covariance Type: nonrobust
coef std err t P>|t| [0.025 0.975]
x1 2.9135 0.907 3.211 0.001 1.131 4.696
x2 5.7809 0.237 24.361 0.000 5.315 6.247
x3 -0.4197 0.123 -3.416 0.001 -0.661 -0.178
x4 -0.6254 0.091 -6.855 0.000 -0.805 -0.446
x5 0.0153 0.003 6.023 0.000 0.010 0.020
x6 -0.5028 0.041 -12.398 0.000 -0.583 -0.423
Omnibus: 185.681 Durbin-Watson: 1.014
Prob(Omnibus): 0.000 Jarque-Bera (JB): 1105.939
Skew: 1.479 Prob(JB): 7.05e-241
Kurtosis: 9.611 Cond. No. 1.48e+03
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.48e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

After having eliminated parameters with high p-value, we have narrowed it down to the following parameters.
- CHAS – Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- RM – average number of rooms per dwelling
- DIS – weighted distances to five Boston employment centres
- PRATIO – pupil-teacher ratio by town
- B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
- LSTAT – lower status of the population
Now, let’s try and fit the linear regression with just these parameters.
# Fit model based on
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. DIS - weighted distances to five Boston employment centres
# 4. PRATIO - pupil-teacher ratio by town
# 5. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
# 6. CHAS - Charles river dummy variable
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()
X = boston.data
X_reduced = X[:,[3,5,7,10,11,12]]
X_reduced = sm.add_constant(X_reduced)
model = LinearRegression().fit(X_reduced,y)
r2_score = model.score(X_reduced,y)
print ( "r-squared = ",r2_score)
Output
r-squared = 0.7074867589886451
In case you are wondering why the r2 value is different from the Statsmodels’ OLS method (0.957) vs the sklearn’s LinearRegression method (0.707), it is because statsmodels by default does not include the intercept in determining the linear regression equation. You can see that in the help section of OLS documentation.


You can easily fix this by including a constant in the data itself. Statsmodels actually provides a method for that add_constant.
# Include only CHAS,RM,DIS,PRATIO,B,LSTAT
X = boston.data
X_reduced = X[:,[3,5,7,10,11,12]]
X_reduced = sm.add_constant(X_reduced)
model = sm.OLS(y,X_reduced).fit()
model.summary()
Output
OLS Regression Results
Dep. Variable: y R-squared: 0.707
Model: OLS Adj. R-squared: 0.704
Method: Least Squares F-statistic: 201.2
Date: Wed, 01 May 2019 Prob (F-statistic): 1.01e-129
Time: 19:33:07 Log-Likelihood: -1529.2
No. Observations: 506 AIC: 3072.
Df Residuals: 499 BIC: 3102.
Df Model: 6
Covariance Type: nonrobust
Coef std err t P>|t| [0.025 0.975]
const 17.0082 4.255 3.997 0.000 8.648 25.368
x1 2.7209 0.895 3.039 0.003 0.962 4.480
x2 4.4045 0.416 10.582 0.000 3.587 5.222
x3 -0.5574 0.126 -4.428 0.000 -0.805 -0.310
x4 -0.9049 0.114 -7.946 0.000 -1.129 -0.681
x5 0.0115 0.003 4.316 0.000 0.006 0.017
x6 -0.6039 0.047 -12.769 0.000 -0.697 -0.511
Omnibus: 171.907 Durbin-Watson: 1.039
Prob(Omnibus): 0.000 Jarque-Bera (JB): 785.723
Skew: 1.447 Prob(JB): 2.41e-171
Kurtosis: 8.375 Cond. No. 7.08e+03
Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 7.08e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
Now you see that the r2 value matches with sklearn’s LinearRegression method.
Stepwise regression with backward elimination (based on r2) is just one of the methods. There are many other methods that we will be learning as we progress through the course.
Accuracy of the model
We have seen enough statistics for now. But we haven’t actually predicted anything yet, right ? Using the 5 parameters that we have narrowed down to, let’s start predicting median house prices and see how accurate our model is.
In order to predict data, we first need to train the model ( which we have done already ). However, instead of training the model on the entire dataset, we split the data into training and test data sets. This process is also called Train/Test split. And as you might have guessed already, sklearn already has methods for doing this.
Training and Test datasets
How many rows of data do we have in Boston Housing dataset ?
from sklearn.datasets import load_boston
boston = load_boston()
boston.data.shape
Output
(506, 13)
506 rows. That’s all the data we have. Now, let’s split this data into training and test datasets. What is the ratio of the split ?

Typically a 80-20 % split should do. You can also do a 75-25 % split. The exact percentage splits would probably be based mostly on the accuracy of the data and a bit of trail and error.

Essentially, this what we are trying to do.
sklearn’s train_test_split will ensure that the split is made according to the percentage specified and that the rows are random.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(boston.data, boston.target, test_size=0.2)
print ( "X_train size", X_train.shape)
print ( "X_test size", X_test.shape)
print ( "y_train size", y_train.shape)
print ( "y_train size", y_test.shape)
print ( "ratio = ", X_test.shape[0]/boston.data.shape[0])
Output
X_train size (404, 13)
X_test size (102, 13)
y_train size (404,)
y_train size (102,)
ratio = 0.2015810276679842
train_test_split has done a good split @ 20% test data. Now, that we know how to do the split, let’s do it on just the predictors that we need.
# Fit model based on
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. DIS - weighted distances to five Boston employment centres
# 4. PRATIO - pupil-teacher ratio by town
# 5. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town
from sklearn.metrics import mean_squared_error, r2_score
from math import sqrt
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.datasets import load_boston
boston = load_boston()
# Use only the predictors specified above.
X = boston.data[:,[5,7,10,11,12]]
X_train, X_test, y_train, y_test = train_test_split(X, boston.target, test_size=0.2)
model = LinearRegression().fit(X_train,y_train)
Now that we have trained the model, let’s test it.
y_pred = model.predict(X_test)
Now, let’s find out how accurate our predictions are. A quick and dirty way to visually see this would be to plot the predicted vs actuals on a scatterplot
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter ( y_pred, y_test)
plt.xlabel ("Predictions")
plt.ylabel ("Actuals" )
Output
Text(0, 0.5, 'Actuals')

If the prediction were 100% correct, we would get a perfect 450 line. Let’s draw that as a baseline to compare the prediction’s performance.
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter ( y_pred, y_test)
plt.xlabel ("Predictions")
plt.ylabel ("Actuals" )
x = [0,50]
y = [0,50]
plt.plot(x,y,color="r")
Output

That’s not a bad fit. Let’s calculate the r2 to have a numeric estimate of how good of a fit we have.
model.score(X_test,y_test)
##Output##
0.6780106182290351
That’s an r2 of 0.613. So, we can say that we are 61% accurate in our prediction.
Polynomial Regression
So far, we have seen examples of Linear Regression ( both simple and multi). However, not all types of data can be fit into a Linear Regression. For example, population growth is a good example of a non-linear data. Probably a better word for it is exponential growth. Let’s take a simple example – The population of India and how it is projected to grow in this century. ( This is real data taken from United Nations at https://population.un.org/wpp/Download/Standard/Population/ )
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
india = pd.read_csv("../../data/india_population.csv")
india.head()
Output
year population
0 1950 376325200
1 1951 382245303
2 1952 388538620
3 1953 395160091
4 1954 402077026
Let’s plot this 150 years of data ( since 1950 until 2015 and projected data until 2100)
plt.plot(india["year"],india["population"])
plt.xlabel("Year")
plt.ylabel("Population - in Billions")
Output
Text(0, 0.5, 'Population - in Billions')

Can you imagine trying to fit this data using Linear Regression ? Well, it is going to be a huge approximation after – say 2040 where the linearity ends. Let’s try it to see how well it fits.
from sklearn.linear_model import LinearRegression
model = LinearRegression().fit(india[["year"]],india["population"])
# step 2
slope = model.coef_
intercept = model.intercept_
point_1 = slope*1950 + intercept
point_2 = slope*2100 + intercept
# step 3
plt.plot(india["year"],india["population"])
plt.plot([1950,2100], [point_1,point_2],color="r")
Output

Well, it does an OK job given that the model is linear, but can we account for the curve somehow ?
This is exactly where polynomials come in. sklearn has built-in methods for this – PolynomialFeature . Let’s use it to fit the data.
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
poly_2 = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(india[["year"]])
np.set_printoptions(precision=3, suppress=True, linewidth=150)
print ( poly_2_X[0:5,:] )
Output
[[ 1. 1950. 3802500.]
[ 1. 1951. 3806401.]
[ 1. 1952. 3810304.]
[ 1. 1953. 3814209.]
[ 1. 1954. 3818116.]]
What is really happening here ? A single column ( year ) has now become 3 columns.


Nothing has been predicted yet – we have just transformed the predictors. Now, let’s try to fit the new predictors to the response variable.
model_linear = LinearRegression().fit(india[["year"]],india["population"])
model_2_poly = LinearRegression().fit(poly_2_X,india["population"])
data, = plt.plot(india["year"], india["population"], color="blue", label="data")
linear_fit, = plt.plot(india["year"], model_linear.predict(india[["year"]]), color = 'red', label="linear fit")
poly_2_fit, = plt.plot(india["year"], model_2_poly.predict(poly_2_X), color = 'green', label="polynomial fit")
plt.legend((data, linear_fit, poly_2_fit), ('Data', 'Linear Fit', 'Polynomial fit (degree = 2)'))
Output

Visually, you can see the green line (polynomial fit) seems to follow the blue line(data) much better than the red line(linear fit), right ? How about we go one step further and see if increasing the polynomial degree would make this any better.
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
poly_3 = PolynomialFeatures(degree=3)
poly_3_X = poly_3.fit_transform(india[["year"]])
np.set_printoptions(precision=3, suppress=True, linewidth=150)
print ( poly_3_X[0:5,:] )
Output
[[1.000e+00 1.950e+03 3.802e+06 7.415e+09]
[1.000e+00 1.951e+03 3.806e+06 7.426e+09]
[1.000e+00 1.952e+03 3.810e+06 7.438e+09]
[1.000e+00 1.953e+03 3.814e+06 7.449e+09]
[1.000e+00 1.954e+03 3.818e+06 7.461e+09]]

model_3_poly = LinearRegression().fit(poly_3_X,india["population"])
data, = plt.plot(india["year"], india["population"], color="blue")
linear_fit, = plt.plot(india["year"], model_linear.predict(india[["year"]]), color = 'red')
poly_2_fit, = plt.plot(india["year"], model_2_poly.predict(poly_2_X), color = 'green')
poly_3_fit, = plt.plot(india["year"], model_3_poly.predict(poly_3_X), color = 'orange')
plt.legend((data, linear_fit, poly_2_fit, poly_3_fit), ('Data', 'Linear Fit',
'Poly fit (degree = 2)', 'Poly fit (degree = 3)'))

The orange line (polynomial fit, degree = 3) is hugging the actual data curve (blue) much closer than the green line ( polynomial fit , degree = 2 ), right ? Let’s do the numbers ( r2 score ) as well, to really know the degree of fit.
r2_linear = model.score(india[["year"]],india[["population"]])
r2_poly_2 = model_2_poly.score(poly_2_X,india[["population"]])
r2_poly_3 = model_3_poly.score(poly_3_X,india[["population"]])
print ( r2_linear, " --> R squared - Linear fit")
print ( r2_poly_2, " --> R squared - polynomial fit ( degree = 2)")
print ( r2_poly_3, " --> R squared - polynomial fit ( degree = 3)")
Output
0.8434443860759384 --> R squared - Linear fit
0.9776830799010624 --> R squared - polynomial fit ( degree = 2)
0.9953977657287687 --> R squared - polynomial fit ( degree = 3)
Challenges
Challenge 1
Use polynomial regression to fit the following parameters on the boston Housing dataset.
- LSTAT ( Lower Status Population percentage )
- MEDV ( Median house value )
After modeling ,
- Plot both linear and polynomial model ( degree = 2 ) to visually show how they perform
- Compare the r2 values between linear and polynomial model
- Verify if increasing the degree of the polynomial regression ( say degree = 3 or 4 ) increases the performance.
Solution
from sklearn.datasets import load_boston
boston = load_boston()
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(boston.data[:,12],boston.target)
plt.xlabel("LSTAT - Lower Status Population ")
plt.ylabel("MEDV - Median house price")
Output
Text(0, 0.5, 'MEDV - Median house price')

Let’s first try to fit this data with a linear regression model.
# Linear Regression
from sklearn.linear_model import LinearRegression
model_linear = LinearRegression().fit(boston.data[:,12].reshape(-1,1),boston.target)
plt.scatter(boston.data[:,12],boston.target)
plt.scatter(boston.data[:,12],model_linear.predict(boston.data[:,12].reshape(-1,1)))
plt.xlabel("LSTAT - Lower Status Population ")
plt.ylabel("MEDV - Median house price")
Output
Text(0, 0.5, 'MEDV - Median house price')

This dataset doesn’t look all that linear right ? Let’s see how our r2 score compares
model_linear.score(boston.data[:,12].reshape(-1,1),boston.target)
#Output
0.5441462975864797
That’s a 54% fit. Now, let’s try a polynomial fit – say second degree
from sklearn.preprocessing import PolynomialFeatures
# Transform linear to polnomial features
poly_2 = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(boston.data[:,12].reshape(-1,1))
model_2_poly = LinearRegression().fit(poly_2_X,boston.target)
# Plot the actual data along with the fit
plt.scatter(boston.data[:,12],boston.target,alpha=0.25)
linear_fit = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
poly_2_fit = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')

model_2_poly.score(poly_2_X,boston.target)
#Output
0.6407168971636612
0.544 – r2 score with linear modeling
0.640 – r2 score with polynomial modeling
That seems to be some improvement, right ?
from sklearn.preprocessing import PolynomialFeatures
# Transform linear to polnomial features
poly_3 = PolynomialFeatures(degree=3)
poly_3_X = poly_3.fit_transform(boston.data[:,12].reshape(-1,1))
model_3_poly = LinearRegression().fit(poly_3_X,boston.target)
# Plot the actual data along with the fit
plt.scatter(boston.data[:,12],boston.target,alpha=0.25)
linear_fit = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
poly_2_fit = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')
poly_3_fit = plt.scatter(boston.data[:,12], model_3_poly.predict(poly_3_X), color = 'cyan')

model_3_poly.score(poly_3_X,boston.target)
#Output
0.6578476405895719
0.544 – r2 score with linear modeling
0.640 – r2 score with polynomial modeling ( degree = 2 )
0.657 – r2 score with polynomial modeling ( degree = 3 )
That’s just a marginal improvement over second degree polynomial regression. So, you don’t need to go beyond second degree.
How about a 10th degree fit ?
from sklearn.preprocessing import PolynomialFeatures
# Transform linear to polnomial features
poly_10 = PolynomialFeatures(degree=10)
poly_10_X = poly_10.fit_transform(boston.data[:,12].reshape(-1,1))
model_10_poly = LinearRegression().fit(poly_10_X,boston.target)
# Plot the actual data along with the fit
plt.scatter(boston.data[:,12],boston.target,alpha=0.25)
linear_fit = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
poly_2_fit = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')
poly_10_fit = plt.scatter(boston.data[:,12], model_10_poly.predict(poly_10_X), color = 'cyan')

model_10_poly.score(poly_10_X,boston.target)
#Output
0.6844527059749335
0.544 – r2 score with linear modeling
0.640 – r2 score with polynomial modeling ( degree = 2 )
0.657 – r2 score with polynomial modeling ( degree = 3 )
0.684 – r2 score with polynomial modeling ( degree = 10 )
See, even if you go to a 10th degree polynomial fitting, the improvement in r2 is just about 0.04 from a second degree polynomial fitting.
Overfitting
Sometimes, where there is not enough data, the model might tend to overfit. Look at the example data below. We are simulating a sin wave.
import numpy as np
import math
x = np.linspace(10,100,num=20)
y = np.sin(x*math.pi/180)
Let’s plot it to see how it it looks
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x,y)
Output

Let’s introduce a bit of variance to make the data a bit realistic.
variance = np.random.rand(20) / 2
y = y + variance
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x,y)
Output

Now, let’s see how a 2nd degree polynomial regression fits.
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Transform linear to polnomial features
poly_2 = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(x.reshape(-1,1))
model_2_poly = LinearRegression().fit(poly_2_X,y)
# Plot the actual data along with the fit
plt.scatter(x,y)
poly_2_fit = plt.plot(x, model_2_poly.predict(poly_2_X), color = 'cyan')
plt.title("Good fit")
Output
Text(0.5, 1.0, 'Good fit')

This looks to be a good fit, right ? Now, let’s try a higher order polynomial fit ( say degree = 10 )
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Transform linear to polnomial features
poly_10 = PolynomialFeatures(degree=15)
poly_10_X = poly_10.fit_transform(x.reshape(-1,1))
model_10_poly = LinearRegression().fit(poly_10_X,y)
# Plot the actual data along with the fit
plt.scatter(x,y)
# linear_fit = plt.scatter(boston.data[:,12], model_linear.predict(boston.data[:,12].reshape(-1,1)), color = 'red')
# poly_2_fit = plt.scatter(boston.data[:,12], model_2_poly.predict(poly_2_X), color = 'green')
poly_10_fit = plt.plot(x, model_10_poly.predict(poly_10_X), color = 'cyan')
plt.title("Over fit")
Output
Text(0.5, 1.0, 'Over fit')

Let’s put these two figures together to compare.


Overfitting typically happens when the model is trying to work too hard for the data. And why is it a problem ? Overfitting tries to fit the data too much and hence will not work well for new datasets. Think of overfitting as localizing the solution for the test datasets – it is more or less memorizing the data, not generalizing a solution for the dataset. Obviously, it will not work as well when model is used on real data set. We will see more examples of these when we see other machine learning algorithms down the line.
If you are wondering why the simple linear regression is able to learn the model just enough, but the higher degree polynomial regression is over learning it, that is because the higher order polynomial regression has the flexibility to learn more ( as compared to a linear or second order polynomial regression ). This is actually, good, except that it is not able to discern noise from data. Let’s increase the dataset size to see if the same 15 degree polynomial regression peforms better than a second order.
Increase sample size to 1000


You see, data size did not matter, still the 15th order polynomial regression still overfits the data. The reason is that for the amount of data and noise, a second or 3rd degree polynomial has enough power to capture the complexity of this data. Probably for more complicated data sets, an increase in degree might capture the complexity better.
Detect Overfitting
If you look at the pictures above, we were able to clearly see an overfit. This is because it is a 2 dimensional dataset. Most data is multi-dimensional in real life – in which case we cannot be able to identify an overfit, but just looking at a plot. There are basically 2 methods to identify an overfit.
1. Chose a simpler model.
Look at a measure of the score ( r2) – if there is not a significant difference, then better go for a simpler model.
import numpy as np
import math
x = np.linspace(10,100,num=1000)
y = np.sin(x*math.pi/180)
variance = np.random.rand(1000) / 2
y = y + variance
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# 2nd Degree polynomial fit
poly_2 = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(x.reshape(-1,1))
model_2_poly = LinearRegression().fit(poly_2_X,y)
# 15th degree polynomial fit
poly_10 = PolynomialFeatures(degree=15)
poly_10_X = poly_10.fit_transform(x.reshape(-1,1))
model_10_poly = LinearRegression().fit(poly_10_X,y)
print ( model_2_poly.score(poly_2_X,y) )
print ( model_10_poly.score(poly_10_X,y) )
#Output
0.7528691833579731
0.6915335671211775

The r2 score of the 2nd degree fit is better than the 15th degree polynomial model. So, the simpler model ( 2nd degree model) wins.
2. Check the model performance across Training and Test datasets.
Another method to identify an overfit is by validating how well the model is performing against the training dataset vs the test dataset.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.2)
print ( "X_train size", X_train.shape)
print ( "X_test size", X_test.shape)
print ( "y_train size", y_train.shape)
print ( "y_train size", y_test.shape)
print ( "ratio = ", X_test.shape[0]/x.shape[0])
###Output###
X_train size (800,)
X_test size (200,)
y_train size (800,)
y_train size (200,)
ratio = 0.2
Fit a 2nd degree polynomial regression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Transform linear to polnomial features
poly_2 = PolynomialFeatures(degree=2)
poly_2_X = poly_2.fit_transform(X_train.reshape(-1,1))
model_2_poly = LinearRegression().fit(poly_2_X,y_train)
Fit a 15th degree polynomial regression
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import PolynomialFeatures
# Transform linear to polnomial features
poly_10 = PolynomialFeatures(degree=15)
poly_10_X = poly_10.fit_transform(X_train.reshape(-1,1))
model_10_poly = LinearRegression().fit(poly_10_X,y_train)
Compare the r2 scores across both the scenarios for training data set.
print ( model_2_poly.score(poly_2_X,y_train), " -->r-squared - for 2nd Degree Polynomial regression on training dataset" )
##Output##
0.7527035537094962 -->r-squared - for 2nd Degree Polynomial regression on training dataset
print ( model_10_poly.score(poly_10_X,y_train), " -->r-squared - for 2nd Degree Polynomial regression on training dataset" )
##Output##
0.6899011370541518 -->r-squared - for 2nd Degree Polynomial regression on training dataset
poly_2_X_test = poly_2.fit_transform(X_test.reshape(-1,1))
poly_10_X_test = poly_10.fit_transform(X_test.reshape(-1,1))
print ( model_2_poly.score(poly_2_X_test,y_test), " -->r-squared - for 2nd Degree Polynomial regression on Test dataset" )
print ( model_10_poly.score(poly_10_X_test,y_test), " -->r-squared - for 10th Degree Polynomial regression on Test dataset" )
##Output##
0.7531367987161364 -->r-squared - for 2nd Degree Polynomial regression on Test dataset
0.6959846261157662 -->r-squared - for 10th Degree Polynomial regression on Test dataset
As you can see the r-square score has decreased (slightly though ) between training and test datasets.
r-square score between training and test datasets
Model | r2 score (Train) | r2 score (Test) |
---|---|---|
2nd Degree | 0.761 | 0.757 |
15th Degree | 0.707 | 0.661 |
This is actually not a bad fit, given that there is just a slight decrease in how the model fits the training vs test datasets. However, if there is a large difference, you better discard the model.
Prevent Overfitting
Cross Validation
Linear Regression Assumptions
Homoskedasticity
In plain english, what this means is that, the variance between the predicted and the actual values should be uniform across the spectrum.
Homoskedasticity = Uniform Variance
Let’s take the relationship between the number of rooms (“RM”) and the median price value (“MEDV”) and try to fit a linear regression model.
from sklearn.linear_model import LinearRegression
from sklearn.datasets import load_boston
boston = load_boston()
# Linear regression model
model_linear = LinearRegression().fit(boston.data[:,5].reshape(-1,1),boston.target)
Let’s try to predict the house price now that our model is ready.
y_pred = model_linear.predict(boston.data[:,5].reshape(-1,1))
Now that we have predicted the housing prices, let’s see how they compare to the actual house price. The difference between the predicted and actual values is called residuals.
residuals_rm = boston.target - y_pred
Let’s plot the residuals vs the predicted values.
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(y_pred,residuals_rm)
plt.xlabel("Predicted Median House Price")
plt.ylabel("Residuals")
plt.title("Fitted vs Residuals")
Output
Text(0.5, 1.0, 'Fitted vs Residuals')

What this plot goes to show is that the variance ( residuals ) is not uniform across the spectrum of house prices. For example, the variance is high between 10K and 35K. But that’s probably because the bulk of the data lies there. You can strip away outliers and restrict our dataset to between those values and try again.
However, what is the ideal Fitted vs Residuals plot that makes the data homoskedastic ( uniform variance ) ?
import numpy as np
import math
# Generate 10K random numbers between 0 and 100.
x = np.random.randint(0,100,100)
variance = np.random.randint(-10,10,100)
y = x + variance
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(x,y,alpha=0.1)
Output

from sklearn.linear_model import LinearRegression
# Linear regression model
model_linear = LinearRegression().fit(x.reshape(-1,1),y)
# predict y
y_pred = model_linear.predict(x.reshape(-1,1))
#residuals
residuals_rm = y - y_pred
import matplotlib.pyplot as plt
%matplotlib inline
plt.scatter(y_pred,residuals_rm)
plt.xlabel("Predicted Median House Price")
plt.ylabel("Residuals")
plt.title("Fitted vs Residuals")
# draw a horizontal line at 0 to show the horizon.
plt.axhline(color='r', linestyle="-")
Output

This is the ideal plot that confirms homeskedasticity – Also called the Night sky. If you look at the variance(residuals), they are uniformly distributed across the entire spectrum of house prices. What this means is that we can confidently predict the house price across the entire spectrum with the same level of accuracy.
Now, compare it to the previous Fitted vs Residuals plot to understand it better.


Normal Distribution of Residuals
Another assumption is that the residuals should be normally distributed. You can quickly verify that using a histogram,
plt.hist(residuals_rm,bins=20)
##Output###
(array([ 6., 9., 9., 10., 7., 2., 0., 6., 6., 4., 4., 4., 6., 3., 5., 5., 4., 4., 3., 3.]),
array([-8.055, -7.111, -6.168, -5.224, -4.28 , -3.337, -2.393, -1.45 , -0.506, 0.437, 1.381, 2.324, 3.268, 4.212, 5.155, 6.099, 7.042,
7.986, 8.929, 9.873, 10.816]),
<a list of 20 Patch objects>)

Or using a Q-Q plot. Q-Q plot is an acronym for Quantile-Quantile plot. The way it is calculated is by comparing the quantiles of the sample against the quantiles of a normal distribution.
# First sort the residuals ( sample )
residuals_rm.sort()
# How big is the dataset ?
len(residuals_rm)
# create a normal distribution as big as the residuals dataset
norm=np.random.normal(0,2,len(residuals_rm))
# and sort it
norm.sort()
# Scatter plot of normal distribution vs residuals
plt.scatter(norm,residuals_rm)
plt.title("Quantile - Quantile plot ( Q-Q )")
plt.xlabel("Normal Distribution")
plt.ylabel("Residuals")
Output
Text(0, 0.5, 'Residuals')

What this plot goes to show is that the residuals are almost normally distributed except at the fringes. So, if you can identify the outliers and trim them, this dataset satisfies the Normal Distribution of Residuals assumption of Linear Regression.
No Multicollinearity
Another big word – but the explanation is a bit tricky as well. Let’s create a dummy dataset to understand this better.
import numpy as np
x = np.random.randint(1,100,100).reshape(-1,1)
y = np.random.randint(1,100,100).reshape(-1,1)
z = x + y
X = np.concatenate((x,y,z),axis=1)
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# statsmodels always requires the constant to be added as the first column
data = add_constant(X)
Collinearity is predicted using a parameter called VIF, the formula for which is

where Rsq2 is the typical r2 of the individual predictor when regressed (predicted) against the rest of the predictors. So, in the example above, VIF value of zbecomes


for i in range(1,4) :
vif = variance_inflation_factor(data,i)
print ( "vif", " = ", vif )
vif = inf
vif = inf
vif = inf
Try this with all of the following combinations

Point being, as long as there is a relationship between the variables, it is displayed with a VIF or Variance Inflation Factor > 2.
Collinearity and correlation are different though. Look at the picture below. There is some correlation between
- Total Power Consumption <–> Number of rooms
- Total Power Consumption <–> Power consumption per room
But there is a perfect collinearity between
- Total Power Consumption <–> Number of rooms & Power consumption/room

Correlation is between 2 variables, while collinearity is between a single variable and a combination of variables.
Now that we understand collinearity, let’s find out how many variables have a collinear relationship in the Boston Housing dataset.
from sklearn.datasets import load_boston
boston = load_boston()
data = add_constant(boston.data)
for i in range(1,14) :
vif = variance_inflation_factor(data,i)
print ( vif, " - VIF for ",boston.feature_names[i-1])
##Output##
1.7921915474332406 - VIF for CRIM
2.2987581787494418 - VIF for ZN
3.9915964183460297 - VIF for INDUS
1.0739953275537886 - VIF for CHAS
4.393719847577493 - VIF for NOX
1.9337444357832574 - VIF for RM
3.1008255128153372 - VIF for AGE
3.9559449063727263 - VIF for DIS
7.484496335274472 - VIF for RAD
9.00855394759707 - VIF for TAX
1.7990840492488978 - VIF for PTRATIO
1.3485210764063758 - VIF for B
2.9414910780919366 - VIF for LSTAT
So, all of the following variables have a high collinearity.
- 7.484496335274472 – VIF for RAD
- 9.00855394759707 – VIF for TAX
How to fix Collinearity?
- Iteratively remove the predictor with the highest VIF. So, in this case, remove either TAX or PTRATIO and do a re-run.
- Use dimentionality reduction ( say PCA ) to reduce the number of predictors ( More on PCA & dimensionality reduction later)
np.set_printoptions(precision=3, suppress=True, linewidth=250)
data = np.delete(boston.data,8,axis=1)
data = add_constant(data)
feature_names = np.delete(boston.feature_names,8)
for i in range(1,13) :
vif = variance_inflation_factor(data,i)
print ( vif, " - VIF for ",feature_names[i-1])
##Output##
1.664471491773954 - VIF for CRIM
2.273018308327798 - VIF for ZN
3.682264743037106 - VIF for INDUS
1.0615610963150082 - VIF for CHAS
4.304929331571053 - VIF for NOX
1.8854245633324191 - VIF for RM
3.0830086748912335 - VIF for AGE
3.954950605693307 - VIF for DIS
3.4152890003342686 - VIF for TAX
1.7348731617273456 - VIF for PTRATIO
1.34145900343612 - VIF for B
2.9377524102330312 - VIF for LSTAT
Now, all the variables have a VIF of less than 5 ( Less than 2 or 3 is ideal though )