Logistic Regression

Logistic Regression


  Data visualization

Contents

  • What is Logistic Regression
  • Math
    • logit function
    • sigmoid function
  • Implementation
    • Dataset
    • Modeling
    • Visualization
    • Basic Evaluation
  • Optimization
  • Evaluation
    • ROC Curve
    • Area under ROC Curve
  • References

What is Logistic Regression

Logistic regression is a type of linear regression. However, it is used for classification only. Huh.. that’s confusing, right ? Let’s dive in.

Let’s take the simple iris dataset. The target variable as you know by now ( from day 9 – Introduction to Classification in Python, where we discussed classification using K Nearest neighbours ) is categorical in nature. Let’s load the data first.

head(iris)
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<fct&gt;
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

Let’s simplify the dataset to just 2 species –

  • 0 – setosa
  • 1 – versi-color

Let’s just take data for 2 of the species ( say setosa and versi-color ) with just the sepal data ( sepal length and sepal width ) and plot it.



iris_data   = iris[0:100,]

In [9]:

iris_data = iris[0:100,]
plot(iris_data$Sepal.Length, iris_data$Sepal.Width, 
     col = iris_data$Species, pch = 19, 
     xlab = "Sepal Length",
     ylab = "Sepal Width")

Let’s simplify this further – say, we wanted to predict the species based on a single parameter – Sepal Length. Let’s first plot it.In [14]:

plot ( iris_data[,1], iris_data[,5],
       pch = 19, col = "blue",
       xlab = "Sepal Length",
       ylab = "Setosa or Versicolor")

We know that regression is used to predict a continous variable. What about a categorical variable like this ? (species). If we can draw a curve like this,

and for all target values predicted with value > 0.5 put it in one category, and for all target values less than 0.5, put it in the other category – like this.

A linear regression (multilinear in this case) equation looks like this.

Logistic regression is almost similar to linear regression. The difference lies in how the predictor is calculated. Let’s see it in the next section.


Math

The name logistic regression is derived from the logit function. This function is based on odds.

logit function

Let’s take an example. A standard dice roll has 6 outcomes. So, what is the probability of landing a 4 ?

Now, what about odds ? The odds of landing a 4 is

$ where \hspace{0.5cm} P_4 = \frac {1}{6} $

So, when we substitute p into the odds equation, it becomes

OK. Now that we understand Probability and Odds, let’s get to the log of odds.

How exactly is the logistic regression similar to linear regression ? Like so.

Where the predictor ( log of odds ) varies between ( -∞ to +∞ ).

To understand this better, let’s plot the log of odds between a probabilty value of 0 and 1.In [29]:

x = runif ( 100, min = 0.1, max = 0.9)
x = sort ( x )
y = log ( x / (1-x))

In [30]:

# plt.plot(x,y)
# plt.grid()

plot ( x, y , type="l")
grid(5, 5, lwd = 2)

This is the logistic regression curve. It maps a probability value ( 0 to 1 ) to a number ( -∞ to +∞ ). However, we are not looking for a continous variable, right ? The predictor we are looking for is a categorical variable – in our case, we said we would be able to predict this based on probability.

  • p >= 0.5 – Category 1
  • p < 0.5 – Category 2

In order to calculate those probabilities, we would have to calculate the inverse function of the logit function.


sigmoid function

The inverse of the logit curve is the inverse-logit or sigmoid function ( or expit function). The sigmoid function transforms the numbers ( -∞ to +∞ ) back to values between 0 and 1. Here is the formula for the sigmoid function.

In [34]:

x_new = y
y_new = exp(x_new) / (1 + exp(x_new))

plot (x_new, y_new, type="l")
grid(5, 5, lwd = 2)

Essentially, if we flip the logit function 900, you get the sigmoid function.

Here is the trick – As long as we are able to find a curve like the one below, although the target (predictor) is a value between 0 and 1 ( probabilities), we can say that all values below 0.5 ( half way mark ) belongs to one category and the remaining ( values above 0.5 ) belong to the next category. This is the essence of logistic regression.


Implementation

Let’s try to implement the logistic regression function in R step by step.

Data & Modeling

Just to keep the same example going, let’s try to fit the sepal length data to try and predict the species as either Setosa or Versicolor.In [81]:

model = glm(Species ~ Sepal.Length + Sepal.Width, 
            data=iris_data, 
            family=binomial(link="logit"))
Warning message:
"glm.fit: algorithm did not converge"
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"

In [82]:

y_pred = predict(model, iris_data, type="response")

In [83]:

y_pred

12.22044604925031e-1622.22044604925031e-1632.22044604925031e-1642.22044604925031e-1652.22044604925031e-1662.22044604925031e-1672.22044604925031e-1682.22044604925031e-1692.22044604925031e-16102.22044604925031e-16112.22044604925031e-16122.22044604925031e-16132.22044604925031e-16142.22044604925031e-16152.22044604925031e-16162.22044604925031e-16172.22044604925031e-16182.22044604925031e-16197.94264732123352e-13202.22044604925031e-16217.2365355113674e-11222.22044604925031e-16232.22044604925031e-16242.22044604925031e-16252.22044604925031e-16262.22044604925031e-16272.22044604925031e-16282.22044604925031e-16292.22044604925031e-16302.22044604925031e-16312.22044604925031e-16327.2365355113674e-11332.22044604925031e-16342.22044604925031e-16352.22044604925031e-16362.22044604925031e-16376.31794371395771e-10382.22044604925031e-16392.22044604925031e-16402.22044604925031e-16412.22044604925031e-16429.0266451369787e-10432.22044604925031e-16442.22044604925031e-16452.22044604925031e-16462.22044604925031e-16472.22044604925031e-16482.22044604925031e-16492.22044604925031e-16502.22044604925031e-16511521531541551561571580.999999999144556591600.999999999998715611621631641651661671681691701711721731741751761771781791801811821831841850.9999999989774918618718818919019119219319419519619719819911001

These are probability values. We need to convert them to the actual factors (setosa & versicolor), because, we are dealing with just 2 classes. We can just use a simple ifelse ( ) syntax to convert all values > 0.5 to a versicolor and < 0.5 to a setosa.In [91]:

y_pred_levels = as.factor(ifelse(y_pred>0.5,"versicolor","setosa"))

In [89]:

y_pred_levels

1setosa2setosa3setosa4setosa5setosa6setosa7setosa8setosa9setosa10setosa11setosa12setosa13setosa14setosa15setosa16setosa17setosa18setosa19setosa20setosa21setosa22setosa23setosa24setosa25setosa26setosa27setosa28setosa29setosa30setosa31setosa32setosa33setosa34setosa35setosa36setosa37setosa38setosa39setosa40setosa41setosa42setosa43setosa44setosa45setosa46setosa47setosa48setosa49setosa50setosa51versicolor52versicolor53versicolor54versicolor55versicolor56versicolor57versicolor58versicolor59versicolor60versicolor61versicolor62versicolor63versicolor64versicolor65versicolor66versicolor67versicolor68versicolor69versicolor70versicolor71versicolor72versicolor73versicolor74versicolor75versicolor76versicolor77versicolor78versicolor79versicolor80versicolor81versicolor82versicolor83versicolor84versicolor85versicolor86versicolor87versicolor88versicolor89versicolor90versicolor91versicolor92versicolor93versicolor94versicolor95versicolor96versicolor97versicolor98versicolor99versicolor100versicolorIn [90]:

library(caret)

cm = confusionMatrix(y_pred_levels,iris_data[,5])
cm
Warning message in levels(reference) != levels(data):
"longer object length is not a multiple of shorter object length"
Warning message in confusionMatrix.default(y_pred_levels, iris_data[, 5]):
"Levels are not in the same order for reference and data. Refactoring data to match."
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         50         0
  virginica       0          0         0

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9638, 1)
    No Information Rate : 0.5        
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                    1.0               1.0               NA
Specificity                    1.0               1.0                1
Pos Pred Value                 1.0               1.0               NA
Neg Pred Value                 1.0               1.0               NA
Prevalence                     0.5               0.5                0
Detection Rate                 0.5               0.5                0
Detection Prevalence           0.5               0.5                0
Balanced Accuracy              1.0               1.0               NA

Evaluation

Let’s split up the data into training and test data and model it. This time let’s do the full iris dataset. Since there are 3 Species to be predicted, we cannot use glm with a “binomial” family algorithm. Let’s use another library called nnet. As usual, to evaluate categorical target data, we use a confusion matrix.In [3]:

library(nnet)

In [4]:

index = sample(1:nrow(iris),nrow(iris)*.8)
train = iris[index,]
test  = iris[-index,]

In [5]:

model = multinom(Species~.,data = train)
# weights:  18 (10 variable)
initial  value 131.833475 
iter  10 value 11.516467
iter  20 value 4.881298
iter  30 value 4.469920
iter  40 value 4.263054
iter  50 value 3.911756
iter  60 value 3.823284
iter  70 value 3.598069
iter  80 value 3.591202
iter  90 value 3.570975
iter 100 value 3.570835
final  value 3.570835 
stopped after 100 iterations

In [6]:

pred = predict(model,test)

As usual, to evaluate categorical target data, we use a confusion matrix.In [9]:

library(caret)

cm = confusionMatrix(pred, as.factor(test$Species))
cm
Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0          8         1
  virginica       0          2         3

Overall Statistics
                                          
               Accuracy : 0.9             
                 95% CI : (0.7347, 0.9789)
    No Information Rate : 0.5333          
    P-Value [Acc > NIR] : 1.989e-05       
                                          
                  Kappa : 0.8315          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8000           0.7500
Specificity                 1.0000            0.9500           0.9231
Pos Pred Value              1.0000            0.8889           0.6000
Neg Pred Value              1.0000            0.9048           0.9600
Prevalence                  0.5333            0.3333           0.1333
Detection Rate              0.5333            0.2667           0.1000
Detection Prevalence        0.5333            0.3000           0.1667
Balanced Accuracy           1.0000            0.8750           0.8365

That’s a 84% score.


Optimization

Let’s plot the logistic regression curve for the test data set.

Step 1 – Get the dataIn [97]:

iris_data = iris[51:150,]
iris_data = iris_data[order(iris_data$Sepal.Length),]
head(iris_data)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
584.92.43.31.0versicolor
1074.92.54.51.7virginica
615.02.03.51.0versicolor
945.02.33.31.0versicolor
995.12.53.01.1versicolor
605.22.73.91.4versicolor

Step 2 – Model the data using a classifierIn [98]:

model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial)

Step 3 – Plot the Logit curve.In [100]:

library(pROC)

plot( iris_data$Sepal.Length, iris_data$Species, pch = 19, col = iris_data$Species)

points ( iris_data$Sepal.Length, model$fitted.values + 2, col = "orange", type="l", lty=2, lwd=3)

As you can see, still there are quite a bit of mis-classifications. All the false negatives and false positives in the plot above are examples of mis-classification. Irrespective of the algorithm used to calculate the fit, there is only so much that can be done in increasing the classification accuracy given the data as-is. Other terms for True Positive and True Negative are

  • Sensitivity ( True Positive )
  • Specificity ( True Negative )

There is a specific optimization that can be done – and that is to specifically increase accuracy of one segment of the confusion matrix at the expense of the other segments. For example, if you look at a visual of the confusion matrix for our dataset.

For this dataset, classifying the species as “setosa” is positive and “versi-color” as negative.

  • setosa – positive
  • versi-color – negative

Let’s actuall calculate the accuracy values. Say the confusion matrix looks like this.[[11 1] [ 1 12]]In [109]:

tp = (11) / (11 + 1)
fp = (1)  / (11 + 1)
fn = (1)  / (1 + 12)
tn = (12) / (12 + 1)

cat ("True Positive = ", tp, "\n")
cat ("False Positive = ", fp, "\n")
cat ("True Negative = ", tn, "\n")
cat ("False Negative = ", fn, "\n")
True Positive =  0.9166667 
False Positive =  0.08333333 
True Negative =  0.9230769 
False Negative =  0.07692308 

What if we want to predict 100% of setosa ( or a much more accurate classification than 0.9 ). Of course, like we discussed earlier, it will come at a cost. However, there is a usecase for this scenario. For example, if getting a particular classification right is extremely important, then we focus more on that particular classification than the others. Have you seen the Brad Pitt’s movie “World War Z” ? A plague emerges all around the world and an asylum is set up in Israel with a high wall. However, when you enter the wall, they make absolutely sure that you do not have the plague. Say, if you have the plague and if you call that as positive, then essentially you maximize the green box in the picture above.

Or another example would be, if you were to diagonize cancer patients, you would rather want to increase the odds of predicting a cancer patient if he/she really has it (True positive). Even it it comes at a cost of wrongly classifying a non-cancer patient as positive ( false positive ). The former can save a life while the later will just cost the company a patient.


Evaluation

ROC Curve

Receiver Operating Characteristics – also called ROC Curve is a measure of how good the classification is. Scikit Learn has an easy way to create ROC curve and calculate the area under the ROC curve. First off, let’s start with a classifier like Logistic Regression

Step 1 – Get the dataIn [125]:

iris_data = iris[51:150,]
iris_data = iris_data[order(iris_data$Sepal.Length),]
model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial)
library(pROC)

# iris$Species has 3 classes and hence 3 factors. So, we are converting them to
# 0/1 factor using factor (c(iris_data$Species) - 2). 
# -2 so that 3,2 become 1,0
roc = roc ( factor (c(iris_data$Species) - 2), model$fitted.values, plot = TRUE, legacy.axes = TRUE)
Setting levels: control = 0, case = 1

Setting direction: controls < cases

Area under the curve is an indicator of how accurate our classifier is. You can get it as follows.In [126]:

roc$auc

0.7896

Reference

In [ ]:

Regression

Regression


  Data visualization

Summary : Regression is a basic Machine Learning Technique. We will be learning Simple Linear Regression, Multiple Linear Regression and Polynomial Regression in this section. On top of the basic technique itself, we will also be covering many statistical concepts (like Null Hypothesis, p-value, r-2, RMSE ) and other key concepts in machine learning like Feature Selection, Training and Test data splits that will be useful in evaluating models going forward as well.

Contents

What is Regression

In Machine Learning, most problems are classified as supervised vs unsupervised learning. We will first focus on supervised learning algorithms and later work on unsupervised learning algorithms. Supervised learning is once again split into the following 2 groups

  • Classification
  • Regression

Given a particular height and weight, classify the person as either male or female. This is an example of classification. You are essentially trying to classify the person – in this case – as male or female based on certain characteristics.

In contrast, say you are trying to predict the body fat percentage based on height and weight – this is an example of a regression problem. What is the difference ? Body Fat % is a continuous variable – say it starts at a minimum of 2% (really lean) and can go all the way up to 50 % say (extremely obese) – as opposed to a categorical variable in the example above ( Male or Female ).


Why Regression

If you are learning how to solve a regression problem for the first time, you probably need to understand why you need regression to start with. This is probably the simplest of the regression problems. Let’s start with a simple data set – Swedish auto insurance claims. You can google it or get it from kaggle. It is a very small data set with just 2 variables –

  • Number of Claims
  • Total payment for these claims ( in thousands )

Claims come first and the settlement happens much later. Assuming these are claims this company receives per week, is there a way we can predict how much the company will end up paying, just based on the number of claims ?

What value does this bring to the company ?

Being able to predict the payment based on the number of claims gives a very good understanding of the companies expenses very much in advance.

Why do you need machine learning for this problem ?

Each claims is different – A fender bender claims costs a thousand dollars and a total could cost 20 grand. The type of claim does make for a good predictor, but let’s just assume we don’t have that at this point. Even if we had the type of claim, a fender bender can cost anywhere from 300 to 2000 dollars based on the severity of damage, the arbitration and several environmental factors. Essentially, there is no easy way to correlate the claims to the payment. If we tried to do this using some kind of IF, THEN logic, we would be going around in hoops.


Solve Regression in Python

data = read.csv("./data/insurance.csv", skip= 5, header=TRUE)
head(data)
claims	payment
	<int&gt;	<dbl&gt;
1	108	392.5
2	19	46.2
3	13	15.7
4	124	422.2
5	40	119.4
6	57	170.9

Looking OK – but since we are reading data from file, we have to ensure that Python is not reading integers as strings or other object types. Let’s quickly verify if the data types are correct.

class(data$claims)
 'integer'      
class(data$payment)
'numeric' 

Looking good. Now, onto LinearRegression. We don’t have to install any specific packages to do linear regression in R. As part of base R, we have a function called lm ( ).

model = lm( payment ~ claims, data = data)

Our model is ready. Let’s start predicting claims based on the count of claims. We will be using the predict method. But before we do that, let’s plot this out to understand what we have done so far.

# pch stands for plotting characteristics. 
# A value of 19 means that it is a solid dot (as opposed to a default hollow dot)
plot( data$claim,data$payment,pch = 19, col = "blue")

predict = predict(model, data = data[,1])

Linear Regression has already solve this problem for us – we just didn’t realize it yet. The parameters ( also called co-efficients )

  • slope
  • intercept

are the solution.

intercept = coef(model)[1]
slope     = coef(model)[2]

cat ("slope = ", slope, "\n")
cat ("intercept = ", intercept)
slope =  3.413824 
intercept =  19.99449
plot( data$claim,data$payment,pch = 19, col = "blue")
lines(data$claim, fitted(model), col="red")

In case you are wondering how this line was draw, you can very well use the slope and intercept parameters to draw the same line.

How did we get the straight line ?

A straight line can be defined mathematically using

These are also called coefficients. The fit function of LinearRegression has already arrived at these numbers ( slope and intercept ). It has done so based on the data

plot( data$claim,data$payment,pch = 19, col = "blue")

# abline defines the equation as "y = bx + a" as opposed to our definition as "y = ax + b"
# that is why, we are setting a = intercept and b = slope in the equation.
abline(a=intercept, b=slope, col = "red")

What did we achieve ?

What we have essentially done is predicted a relationship between the number of claims and the total amount paid. For example, what is the total amount expected to be paid when the number of claims is 80 ?

Easy, right ?


Prediction

We don’t have to draw lines like this every time to predict the value of Y for a value of X. You can use the predict ( ) function. Let’s first predict all the original claim values.

pred = predict(model, newdata = data.frame(claims=data$claims))
head(pred)
1388.687430246283284.8571334003758364.37419203997764443.3086072073445156.547428161776214.582428682898 

You don’t have to predict all the original values. Normally, we do a train/test split (which, we will see later) and try to predict the accuracy based on the test dataset. However, we can throw any predictor (claims in this case) value to see how the model predicts the payments for. For example, say we pick 5 different claim values (10,20,30,40,50) and want to find out what our model predicts.

pred = predict(model, newdata = data.frame(claims=c(10,20,30,40,60)))
pred
154.1327213597785288.27095696044223122.4091925611064156.547428161775224.823899363097 

Alright, let’s see how the predicted values compare against the actual values.

What were the original values though ? You can pick them up from the CSV.

original_claim_values = c(65.3,98.1,194.5,119.4,202.4)

Let’s also plot these to compare how well we predicted.

plot(c(10,20,30,40,60), pred ,pch=19, col="blue")
points(c(10,20,30,40,60),original_claim_values, pch=19, col="red"  )

# abline defines the equation as "y = bx + a" as opposed to our definition as "y = ax + b"
# that is why, we are setting a = intercept and b = slope in the equation.
abline(a=intercept, b=slope, col = "black")

Some values are pretty close, some are a bit off – but nevertheless its a good prediction for the amount of time we spent doing it.


Simple Linear Regression

What we have seen so far is an example of implementing Simple Linear Regression in R. What we will see in this section is the math behind it – Give it a try – if it gets tedious, please do a couple of re-reads. It is critical that you understand this section.

How did LinearRegression fit the straight line

The fun starts now. How did the LinearRegression ( ) function fit the straight line ? How did it it arrive at this equation

y=3.4x+19.99

Obviously, there is no one straight line that will pass through all the points in this case.

If you take these 4 data points, we can eyeball a straight line that goes through the middle of it. The ones marked with question marks are visually not a good fit at all. But the question that linear model tries to solve is,

What is the “Optimum” straight line that best describes the relationship between X and Y

This is where statistics comes in.


Let’s zoom in

Let’s simplify and make up some numbers ( for easy calculation) of claims vs payments. Say we have a set of 5 data points for claims vs payments and we wish to fit a linear model that can predict further data. This is a very small data set to do anything practical, but there is a reason why we are doing such a small data set as we will see in the coming sections.

claims 	payments (inthousands
20 	40
40 	60
60 	80
80 	80
100 	90

Let’s plot these on a graph.

If we were asked to eyeball a straight line that best fits these data points, this is how we would do it.

How did we do it ? Our eye is essentially trying to minimize the distances from each of these points to the straight line. The best fitting straight line is one which minimizes the distances for all these points.

Linear regression in machine learning does exactly that – Instead of a human eye, machine learning takes the help of statistics to do this approximation. There are a couple of methods to do this in statistics.

  • Ordinary Least Squares
  • Gradient Descent

Let’s explore the first method here.


Residuals

When we tried to minimize the distance of each of these points from the line we are trying to fit, the distances between the points and the straight line ( on the y axis ) are called residuals.


Sum of Squares

Warning – Geeky Statistics stuff

To arrive at the best fit values for the straight line, statisticians have arrived at the following formula based on the method of least squares. How they arrived at it is pretty geeky and let’s leave that to the statisticians.

This equation sounds scary, but it is not. I am going to prove it to you in a minute down below. There are 2 things in this equation that require an explanation.

  • The weird symbol that looks like a knocked up W . It is used for summation.
  • y with a bar on it ( or x with a bar ). The bar just represents the average. So y with a bar on it represents the average.

Let’s take the same numbers that we have above and try to calculate the formula by hand. Excel would make things easy, but let’s just do it manually, since the numbers are not all that bad.

That was huge – Now we can easily calculate a and b from the Y = a + b X equation. The convention for slope and intercept here is different from what we referred to previously. You can stick to one convention and use it. However, you might see multiple variations of this (like Y = mx + b for example).

Validation

Let’s cross validate this equation.

sample = data.frame( x=c(20,40,60,80,100),y = c(40,60,80,80,90))
sample
x	y
<dbl&gt;	<dbl&gt;
20	40
40	60
60	80
80	80
100	90

Let’s model this data and plot it

model = lm(y~x, data = sample)
plot(sample$x, sample$y, pch=19, col="blue")


Now, let’s use our model to visualize the fit

intercept = coef(model)[1]
slope     = coef(model)[2]

cat ("slope = ", slope, "\n")
cat ("intercept = ", intercept)

slope =  0.6 
intercept =  34

These are the slope and intercept values. We can now use these to plot the fit line.

plot(sample$x,sample$y,pch=19,col="blue")
abline(a=intercept, b=slope, col = "red")

The model object has a whole host of other information that you can use to predict how good of a fit the line is to the data. But first, let’s predict the value in a table.

pred_y = predict (model, newdata = data.frame(x=sample$x))
pred_y
146258370482594 

The differences ( residuals ) are highlighted below.

There are so many other parameters ( like the p-value, r-squared, r-squared adjusted ) and graphs ( Residuals vs fitted , Q-Q Plot etc ) that are used to analyze the performance of a model fit. We are going to get to that in the next section. To set the stage for these parameters, let’s scale up our data set.


Multilinear Regression

So far we have seen one predictor and one response variable. This is also called simple linear regression. Ofcoure, real world is not that simple, right ? There can be multiple predictors. We are going to look at one such example in a classic example – Boston Housing dataset.

It has 13 predictors and 1 response variable. So, the equation for that mathematically would be

the value of n would be 13 in this case. Let’s look at the dataset below.


Boston Housing dataset

Predicting the price of a house is based on many parameters like the size, neighborhood, crime rate, pollution etc. Essentially, there is no mathematical equation that can predict the price of a house based on these parameters – that’s essentially where ML comes into the picture.

In the earlier example, we just had to predict the value of Y given a value of X. There is just 1 predictor ( X ). However, let’s have a look at the Boston Housing data set – tons of variables. Load the data first.

There are a couple of ways to load the data.

Download and load from file system

boston_housing = read.csv("./data/boston_housing.csv")

head(boston_housing,2)
crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<int&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;
1	0.00632	18	2.31	0	0.538	6.575	65.2	4.0900	1	296	15.3	396.9	4.98	24.0
2	0.02731	0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	396.9	9.14	21.6

The attribute names are a bit cryptic. So, here are the descriptions.

  • CRIM per capita crime rate by town
  • ZN proportion of residential land zoned for lots over 25,000 sq.ft.
  • INDUS proportion of non-retail business acres per town
  • CHAS Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
  • NOX nitric oxides concentration (parts per 10 million)
  • RM average number of rooms per dwelling
  • AGE proportion of owner-occupied units built prior to 1940
  • DIS weighted distances to five Boston employment centres
  • RAD index of accessibility to radial highways
  • TAX full-value property-tax rate per USD – 10,000
  • PTRATIO pupil-teacher ratio by town
  • B 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
  • LSTAT % lower status of the population
  • MEDV Median value of owner-occupied homes in USD 1000’s

Load it directly from mlbench library

library(mlbench)
data(BostonHousing)
head(BostonHousing,2)
crim	zn	indus	chas	nox	rm	age	dis	rad	tax	ptratio	b	lstat	medv
	<dbl&gt;	<dbl&gt;	<dbl&gt;	<fct&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;
1	0.00632	18	2.31	0	0.538	6.575	65.2	4.0900	1	296	15.3	396.9	4.98	24.0
2	0.02731	0	7.07	0	0.469	6.421	78.9	4.9671	2	242	17.8	396.9	9.14	21.6

How large is the dataset ?

dim(boston_housing)
506
14

so, it is 506 rows and 14 columns.

Looking back at our equation,


Response Variable vs Predictor

What is the problem we are trying to solve ? We are actually trying to predict the Median house price based on a variety of parameters – like Crime rate, Nitric Oxide content in the air, age of the house etc. Each of these parameters have a “say” in the price of the house. The target variable is valled the Response Variable and the parameters that have a “say” in determining the target variable are called the Predictors.

In our case, there are 12 predictors ( CRIM, ZN, INDUS…LSTAT) and 1 response variable ( MEDV ). Just for simplicity sake, let’s just pick one parameter – rm – the number of rooms in the house. Let’s see how well we predict the price of the house based on just this one parameter.

model = lm (medv ~ rm, data = boston_housing)
intercept = coef(model)[1]
slope     = coef(model)[2]

cat ("slope = ", slope, "\n")
cat ("intercept = ", intercept)
slope =  9.777871 
intercept =  -38.27585

Let’s plot our predictor vs response variable as a scatter plot and draw the straight line that our model has predicted.

plot(boston_housing$rm, boston_housing$medv, pch=19, col = "blue", xlim=c(4,10)) # try this without setting the range of x values (using xlim)
abline(a=intercept, b=slope, col = "red")


Looks like a decent enough fit. Let’s do another – lstat – lower status population percentage.

Step 1 – Model the data

model = lm (medv ~ lstat, data = boston_housing)
intercept = coef(model)[1]
slope     = coef(model)[2]

cat ("slope = ", slope, "\n")
cat ("intercept = ", intercept)


Step 2 – Determine the slope and intercept from the model

slope =  -1.010507 
intercept =  35.31254

Step 3 – Plot the data and fit line

plot(boston_housing$lstat, boston_housing$medv, pch=19, col = "blue") 
abline(a=intercept, b=slope, col = "red")

This seems accurate enough. However, if you look at both these graphs, the relationship between the predictors and response variable seems a bit different between the predictor LSTAT vs RM. Let’s put these graphs side by side to understand better.

The first picture is lstat vs medv and the second is rm vs medv.

Also, not all variables might be relevant ( irrespective of the direction, decrease or increase ). Let’s take the parameter dis – distance to employment. Once again, if we try to fit this using our steps

# step 1
model = lm (medv ~ dis, data = boston_housing)

# step 2
intercept = coef(model)[1]
slope     = coef(model)[2]

cat ("slope = ", slope, "\n")
cat ("intercept = ", intercept)

# step 3
plot(boston_housing$dis, boston_housing$medv, pch=19, col = "blue",xlim=c(0,10)) 
abline(a=intercept, b=slope, col = "red")

slope =  0.584848 
intercept =  21.38557

Visually, there is not much of a linear relationship between distance and median value ( although we tried to fit it ). How exactly do we measure the relationship ?


Correlation

This is the simplest measure of relation between 2 numeric variables. Luckily, pandas provides a built-in method for calculating correlation – corr ( ). For example,

# use = "pairwise.complete.obs" ensures that NAs are handled 
cor(boston_housing$medv, boston_housing$dis, use = "pairwise.complete.obs")
 0.138798442415068 

Correlation values are calculated to values between 0 and 1. Technically, the values can vary between -1 and +1. 0 being no correlation and 1 being highly correlated ( -1 also signifies a high correlation, just that it is a negative correlation , meaning if the predictor values increases, the response value decreases). In the example above, the relationship between distance and median value is just 13 %. How about others predictors ?

cor(boston_housing$medv, boston_housing$lstat, use = "pairwise.complete.obs")
-0.706255058922179     
cor(boston_housing$rm, boston_housing$medv, use = "pairwise.complete.obs")
0.740180804891272 

This seems in-line with our plots above right ?

Correlation is NOT causation

However strong the correlation is, it does NOT imply causation. This is a bit tricky to understand. For example, look at the picture below.

plot(boston_housing$rm,boston_housing$medv,xlim=c(1,10), pch=19, col="blue")

There seems to be a significant correlation between the number of rooms ( rm ) and the median price value (medv).

Now, imagine another column – Power Consumption . As the number of rooms increase, the power consumption also tends to increase.

Does it mean that Power Consumption has a strong correlation with the house price ? Very unlikely – isn’t it ? Mathematically, correlation is just a tool that signifies the level of correlation between 2 variables. However, it is upto us (or the domain expert) to determine if the correlation is real or ficticious.


Quiz

:The higher the correlation, lower the stronger is the relationship between the variables.
:True
:False
: Correlation is a measure of how strong the relationship is between two variables.
:True
:False
: -0.95 represents a strong positive correlation
:True
:False
: A correlation value of 0 shows that there is a perfect correlation between the variables
:True
:False
: The picture below shows a strong negative correlation
:True
:False
x = c(10,20,30,40,50,60,70,80,90,100)
y = c(100,90,80,70,60,50,40,30,20,10)

plot(x,y, pch=19, col="blue")

p value

While the correlation quantifies the relationship between two variables, it doesn’t tell us if the relationship is statistically significant. The word statistically significant deserves an explanation. If a relationship (between two variables) is NOT caused by chance, then it is statistically significant . That is exactly what p-value answers. Also called probability value, p-value answers the following question –

If a predictor is relevant to predict the response and if yes, how relevant is it ?

We have to understand p-value in the context of Null Hypothesis.

Null Hypothesis

Null hypothesis (denoted as H0 assumes that there is no relationship between the predictor and the response variable. For example, if you look at the relationship between the number of rooms ( rm ) and the median price value of the home ( medv ) in the Boston Housing dataset, Null hypothesis says that there is NO relationship between them.

Well, although we can see a linear relationship visually ( almost ) between those 2 variables, we start off with the Null Hypothesis. It is indicated in statistics as H0

Alternate hypothesis indicates that they are related. It is indicated as H1 . P-value indicates how much the observed data is inconsistent with the Null Hypothesis. This is a bit cryptic, right ? Let’s explore this further.

Let’s just consider 2 random variables that are normally distributed. Since they are random variables, there would be no relationship between them, right ? Let’s check it out.

Dataset 1

A normal distribution of 100 values with a mean of 100 and sd of 20

x = rnorm(n = 100, mean = 100, sd = 20)

Dataset 2

Another normal distribution of 100 values with a mean of 100 and sd of 20

y = rnorm(n = 100, mean = 100, sd = 20)

Let’s plot x vs y and see how it looks.

plot(x,y)

This almost looks like the night sky, isn’t it ? Point being, there is no relationship between x and y as they are random variables – That is pretty understandable right ? Let’s try to calculate a relationship between these two variables ( although there is none ) and see how it compares against another situation where there IS actually a relationship.

# step 1
model = lm (y ~ x)

# step 2
intercept = coef(model)[1]
slope     = coef(model)[2]

cat ("slope = ", slope, "\n")
cat ("intercept = ", intercept)

# step 3
plot(x, y, pch=19, col = "blue") 
abline(a=intercept, b=slope, col = "red")
slope =  0.01513006 
intercept =  98.19433

What is the p-value in this case ? Look at the summary of the linear model. The last column against the predictor (x in this case) is the p-value.

summary(model)
Call:
lm(formula = y ~ x)

Residuals:
    Min      1Q  Median      3Q     Max 
-44.483 -14.434  -0.342  11.768  50.386 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept) 98.19433   10.12278   9.700 5.41e-16 ***
x            0.01513    0.09677   0.156    0.876    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 19.69 on 98 degrees of freedom
Multiple R-squared:  0.0002494, Adjusted R-squared:  -0.009952 
F-statistic: 0.02445 on 1 and 98 DF,  p-value: 0.8761

And the p-value in this case is 0.89.

p-value is 0.89 – that’s 89 %. p-value is always between 0 and 1. 0.89 is a big number right ? Does it indicate that there is a strong relationship between x and y in this case ? On the contrary, a high p-value indicates that the there is NO relationship between x and y – which is what the Null Hypothesis states.

On the other hand, if we calculate the p-value of the relationship between the number of rooms (“RM”) and median price (“MEDV”), then you get a very small value.

model = lm (boston_housing$medv ~ boston_housing$rm)

summary(model)

Call:
lm(formula = boston_housing$medv ~ boston_housing$rm)

Residuals:
    Min      1Q  Median      3Q     Max 
-25.674  -2.488  -0.168   2.411  39.680 

Coefficients:
                  Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)       -38.2758     2.6708  -14.33   <2e-16 ***
boston_housing$rm   9.7779     0.4187   23.35   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 5.93 on 450 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.5479, Adjusted R-squared:  0.5469 
F-statistic: 545.3 on 1 and 450 DF,  p-value: < 2.2e-16


p-value in this case is 2.0 x 10-16. That is an extremely low value ( 0.0000..00248 ) .

What it means is that there is a 0.0000..00248 % chance that the correlation is random.

or in other words, there is a 99.99999 % chance that the data does represent a relationship between the two variables. Whereas in the case of the random variales, there is a 89% chance that the correlation is random


Optimum p-value

A value below which we can safely conclude that the relationship between 2 variables did not happen by chance, but because there is a true causal relationship, is called an optimum p-value.

Is there a fixed p-value below which, we can safely conclude a strong relationship (Alternate Hypothesis) ?

Typically p <= 0.05 is accepted in most cases. However, the actual value at which the business is comfortable is based on a variety of parameters like

  • type of data
  • level of confidence the business requires etc

Key Parameters

r-squared – r2

R-squared is a measure of the explanatory power of the model – How well are the variables able to explain the response variables using the model. A model’s r2 varies between 0 and 1. Applying the summary function on the lm ( ) model should give us the r2 value.

model = lm (boston_housing$medv ~ boston_housing$rm)

summary(model)$r.squared

  0.547867623929491 

In the example above, we tried to model the response variable medv ( median house value ) as a measure of the number of rooms ( rm ) – the r2 value is 0.547 . It is a measure of how well the model explains the relationship. A low value of r2 ( r2 = 0 ) means that the explanatory variables are not able to predict the response variable well

A high value or r2 ( r2= 1 ) means that the explanatory variables are fully able to predict the response – In this case the number of rooms ( rm ) is able to explain the variance in the median house value around 48 %. The remaining 52% variance is unexplained by the explanatory variable.


How is r2 calculated

The formula to calculate r2 is

The denominator Sum of Squares total is the worst possible score. The numerator Sum of Squares residuals is how far the model is performing. So r 2 is essentially a normalized score of how well the model is predicting the data – in comparision to the worst possible score.

What do we mean by worst possible score ? Well, if we know the weights of each of the individuals in the class, without any indicator of whose that weight is ( just a list of weights ), what is our best possible prediction of weight for anybody in the class ? We just do an arithmetic average – right ? And that’s what we use to calculate the Sum of Squares total .

Let’s calculate r2 by hand for our simple dataset.

Let’s verify this in R.

x = c(20,40,60,80,100)
y = c(40,60,80,80,90 )
model = lm (y ~ x)

summary(model)$r.squared
0.9

There you go – our manual calculation verifies with R’s calculation.

Exercise – Calculate the r2 value of of Boston housing dataset for the predictor – NOX ( Level of Nitric Oxide in the air ).


r-squared adjusted

Mathematically, r2 has a peculiar property. Adding more predictors increases the value of r2 . This is not very intuitive to begin with. Let’s try it on our Boston Housing dataset.

In the example above, we have tried to model medv from rm . So, the only explanatory variable is rm ( number of rooms ). Based on that relationship we have an r2 value of 0.54 . What would happen if we add another explanatory variable ? – say lstat ( percentage of lower status of the population ).

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status

model = lm (boston_housing$medv ~ boston_housing$rm + boston_housing$lstat)

summary(model)$r.squared
0.652305668932009 

See, the r 2 has increased from 0.48 to 0.63 by including a second predictor. How about a third predictor ?

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. NOX   - Nitric oxide in the air

model = lm (boston_housing$medv ~ boston_housing$rm + boston_housing$lstat + boston_housing$nox)

summary(model)$r.squared
0.652671556550756 

There is a slight increase in r 2 – from 0.6385 to 0.6389.

r2 value    – Predictors
0.483525 – number of rooms
0.638561 – number of rooms + lower stata population
0.638910 – number of rooms + lower stata population + Nitric Oxide in the air

You can keep adding as many predictors as you want and you can observe that r 2 value always increases with every predictor. That doesn’t seem right. Isn’t it ? Let’s try something. To the boston housing dataset, let’s add a column with just random numbers and see if r 2 still increases. If it does, then we have a problem.

# Generate a column of 506 random numbers
x = rnorm(n = 506, mean = 100, sd = 20)

# and add it to the boston dataset
boston_housing_new = cbind(boston_housing, x)

# what is the new shape ? It should be 13 columns ( as opposed to the original 12 )
dim ( boston_housing_new)

506
15

Now, let’s try the regression with the predictors RM, LSTAT, NOX and the new random variable.

model = lm (boston_housing_new$medv ~ boston_housing_new$rm + boston_housing_new$lstat + boston_housing_new$nox + boston_housing_new$x)

summary(model)$r.squared
0.652671894013658 

r2 value – Predictors
0.483525 – number of rooms
0.638561 – number of rooms + lower stata population
0.638910 – number of rooms + lower stata population + Nitric Oxide in the air
0.639007 – number of rooms + lower stata population + Nitric Oxide in the air + some random variable

This is crazy, right ? Just add any random variable ( which is not supposed to have any predictive power) and still the r 2 increases ? You are probably beginning to doubt the predictive power of r 2 in the first place. Well, it’s not all bad with r 2. Just that every random variable has some predictive power. In order to counter this, there is a new variable called r2 adjusted.

where

  • n = sample size
  • p = number of predictors

Essentially, the new parameter r2 adjusted works by penalizing more parameters. That is why p – the number of predictors is in the denominator.

What it goes to show is that, adding more predictors does not necessarily increase the explanatory power of the model. r2 adjusted accommodates for this by incorporating a penalty for the number of predictors ( more the predictors, lesser the r2 adjusted ).

When you add a predictor it should add significant value. If it is not ( For example, adding NOX – Nitric Oxide as a predictor ) r2-adjusted tells you that it is a worthless predictor and you better get rid of it.

summary(model)$adj.r.squared
    0.649563812528321 
summary(model)$r.squared

 0.652671894013658 

Exercise

As an exercise, please model a multi-linear regression of Boston Housing dataset using

  • Scenario 1 – number of rooms + lower stata population + Nitric Oxide in the air
  • Scenario 2 – number of rooms + lower stata population + Nitric Oxide in the air + random variable

Calculate r2 and r2 adjusted for both the scenarios. Did r2 adjusted decrease or increase in scenario 2 ?


RMSE – Root Mean Square error

While r2 is a relative measure of fit, RMSE is an absolute measure of fit. Lower values indicate a better fit and higher values not. Calculating RMSE is quite simple – it is quite similar to the standard deviation (the square root of variance). While Standard Deviation ( sd ) is a measure of how far the distribution is from the mean, RMSE is a measure of how far the fit is from the actuals.

So, to calculate the RMSE, you just have to borrow the residuals from the model and do a couple of mathematical operations ( quite similar to how you do them to the differences from the mean in the case of mean )

sqrt( mean ( summary(model)$residuals ^2 )  )
5.18556387861621 

Now, you might have a question here. When you already have the RMSE to measure the fit, why do you need another metric – r2 ? Well, you can’t measure the growth rate of an elephant and a horse in absolute terms. RMSE is in the magnitute of the actual response variable, while r2 is on a uniform scale ( 0 to 1 ).


Feature Selection

In the Boston Housing dataset example, there are 13 predictors and 1 response variable. And the response variable is MEDV – Median house value.

Previously, we have seen some examples of predicting the house value based on a random set of predictors –

  • RM – Number of rooms
  • LSTAT – Percentage of people with lower status
  • NOX – Nitric oxide content etc.

However, we know that not all these variables have an equal say. So, how do we know which of the predictors have the most impact in predicting the house price ?

The process of identifying the features ( predictors ) that have the most predictive power is called Feature Selection

This process is called Stepwise Regression. Let’s explore it in the next section.


Stepwise Regression

In stepwise regression, we select a parameter to determine the fit. Based on how well the model is being fit, we add or keep removing predictors ( features ) until we come to a point where we can longer improve the model based on the selected parameter.

What kind of parameters can be used to evaluate the model ? There are many choices like

  • p-value
  • r-squared
  • r-squared (adjusted )
  • F-tests or T-tests etc

For now, since we are already aware of p-value let’s choose it as our selection criteria parameter.

model = lm ( medv ~ . , data= boston_housing)
summary(model)
Call:
lm(formula = medv ~ ., data = boston_housing)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.3991  -2.5377  -0.5848   1.7101  27.9916 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  20.866365   5.477190   3.810 0.000159 ***
crim         -0.238025   0.223853  -1.063 0.288229    
zn            0.038221   0.013372   2.858 0.004462 ** 
indus         0.051356   0.059643   0.861 0.389683    
chas          2.435048   0.829794   2.935 0.003516 ** 
nox         -11.657986   3.928009  -2.968 0.003163 ** 
rm            5.110158   0.454838  11.235  < 2e-16 ***
age          -0.006094   0.013101  -0.465 0.642031    
dis          -1.271514   0.195766  -6.495 2.26e-10 ***
rad           0.294444   0.085387   3.448 0.000619 ***
tax          -0.011360   0.003612  -3.145 0.001776 ** 
ptratio      -0.831030   0.127000  -6.544 1.68e-10 ***
b             0.012314   0.003513   3.505 0.000503 ***
lstat        -0.520753   0.057428  -9.068  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.553 on 438 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.7405, Adjusted R-squared:  0.7328 
F-statistic: 96.16 on 13 and 438 DF,  p-value: < 2.2e-16

Since we have to chosen p-value as our key assesment criteria, we are interested in this part of the table.

Irrespective of the selection criteria, there are 2 basic methods in stepwise regression.

  • Backward Elimination
  • Forward Selection

Backward Elimination

As we see from the table above, not all predictors have the same p-value. So, in backward elimination, we start by eliminating the predictor with the worst parameter value (p-value in this case) and re-evaluate the model again.

# Eliminate INDUS predictor. 
X = boston_housing[,c(1,2,4,5,6,7,8,9,10,11,12,13,14)]

model = lm(medv ~ . , data = X)
summary(model)

Call:
lm(formula = medv ~ ., data = X)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.3513  -2.5184  -0.5814   1.6283  28.0395 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  20.669661   5.470812   3.778 0.000180 ***
crim         -0.233426   0.223724  -1.043 0.297353    
zn            0.037131   0.013308   2.790 0.005498 ** 
chas          2.508402   0.825166   3.040 0.002508 ** 
nox         -10.830319   3.807459  -2.845 0.004656 ** 
rm            5.075414   0.452912  11.206  < 2e-16 ***
age          -0.006238   0.013096  -0.476 0.634071    
dis          -1.307708   0.191144  -6.841 2.64e-11 ***
rad           0.278620   0.083362   3.342 0.000902 ***
tax          -0.010031   0.003265  -3.072 0.002258 ** 
ptratio      -0.817289   0.125956  -6.489 2.34e-10 ***
b             0.012233   0.003511   3.485 0.000542 ***
lstat        -0.515724   0.057113  -9.030  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.552 on 439 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.7401, Adjusted R-squared:  0.733 
F-statistic: 104.2 on 12 and 439 DF,  p-value: < 2.2e-16

Eliminate Crime ( crim) next.

Zone (zn) is the next parameter that has a relatively higher value.

# Eliminate CRIM predictor. 
X = boston_housing[,c(2,4,5,6,7,8,9,10,11,12,13,14)]

model = lm(medv ~ . , data = X)
summary(model)

Call:
lm(formula = medv ~ ., data = X)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.7560  -2.4916  -0.5937   1.6188  28.1229 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  21.327935   5.434858   3.924 0.000101 ***
zn            0.035475   0.013214   2.685 0.007535 ** 
chas          2.544654   0.824517   3.086 0.002155 ** 
nox         -11.589297   3.737700  -3.101 0.002055 ** 
rm            5.043166   0.451901  11.160  < 2e-16 ***
age          -0.006310   0.013097  -0.482 0.630189    
dis          -1.297433   0.190909  -6.796 3.51e-11 ***
rad           0.220041   0.061626   3.571 0.000395 ***
tax          -0.010079   0.003265  -3.087 0.002152 ** 
ptratio      -0.815266   0.125954  -6.473 2.57e-10 ***
b             0.012720   0.003480   3.655 0.000288 ***
lstat        -0.527443   0.056004  -9.418  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.552 on 440 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.7394, Adjusted R-squared:  0.7329 
F-statistic: 113.5 on 11 and 440 DF,  p-value: < 2.2e-16

Next, we eliminate age.

# Eliminate CRIM predictor. 
X = boston_housing[,c(2,4,5,6,8,9,10,11,12,13,14)]

model = lm(medv ~ . , data = X)
summary(model)

Call:
lm(formula = medv ~ ., data = X)

Residuals:
     Min       1Q   Median       3Q      Max 
-18.6405  -2.5670  -0.5505   1.6077  27.8556 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  21.679282   5.381020   4.029 6.60e-05 ***
zn            0.036260   0.013102   2.768 0.005885 ** 
chas          2.525351   0.822826   3.069 0.002279 ** 
nox         -12.097475   3.582668  -3.377 0.000799 ***
rm            4.988713   0.437159  11.412  < 2e-16 ***
dis          -1.271884   0.183237  -6.941 1.40e-11 ***
rad           0.222320   0.061391   3.621 0.000327 ***
tax          -0.010122   0.003261  -3.104 0.002034 ** 
ptratio      -0.820426   0.125389  -6.543 1.68e-10 ***
b             0.012598   0.003468   3.633 0.000313 ***
lstat        -0.537831   0.051641 -10.415  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.548 on 441 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.7393, Adjusted R-squared:  0.7334 
F-statistic: 125.1 on 10 and 441 DF,  p-value: < 2.2e-16

At this point, almost all of the predictors have a significantly lower enough ( < 0.05 ) p-value. If you want to go further, probably you will end up with a set of predictors like this.

Stepwise regression with backward elimination (based on r2) is just one of the methods. There are many other methods that we will be learning as we progress through the course.


Accuracy of the model

We have seen enough statistics for now. But we haven’t actually predicted anything yet, right ? Using the 5 parameters that we have narrowed down to, let’s start predicting median house prices and see how accurate our model is.

In order to predict data, we first need to train the model ( which we have done already ). However, instead of training the model on the entire dataset, we split the data into training and test data sets. This process is also called Train/Test split. And as you might have guessed already, doing this R is pretty easy.

Training and Test datasets

How many rows of data do we have in Boston Housing dataset ?

dim(boston_housing)
506
14

506 rows. That’s all the data we have. Now, let’s split this data into training and test datasets. What is the ratio of the split ?

Typically a 80-20 % split should do. You can also do a 75-25 % split. The exact percentage splits would probably be based mostly on the accuracy of the data and a bit of trail and error.

Essentially, this what we are trying to do.

Use R’s sample ( ) function to sample the indices of 80% of the dataset size.

# Split the data to train and test

# 1. Get the number of rows 
rows = nrow(boston_housing)

# 2. sample out 80% of the numbers between 1 and the number of rows
index = sample(1:rows, rows*0.8)

# 3. get the training data using the indices that correspond to 80% of the dataset
train = boston_housing[index,]

# 4. The remaining part of the data set becomes the test dataset. 
test = boston_housing[-index,]

Now, that we know how to do the split, let’s do it on just the predictors that we need.

# Fit model based on 
# 1. RM - Number of rooms
# 2. LSTAT - Percentage of people with lower status
# 3. DIS   -  weighted distances to five Boston employment centres
# 4. PRATIO - pupil-teacher ratio by town 
# 5. B - 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town

model = lm ( medv ~ rm + lstat + dis + ptratio +  b, data = boston_housing)


Now that we have trained the model, let’s test it.

pred = predict ( model, newdata = test )

Now, let’s find out how accurate our predictions are. A quick and dirty way to visually see this would be to plot the predicted vs actuals on a scatterplot.

plot ( pred, test$medv,
       xlab = "Predictions",
       ylab = "Actuals",
       xlim=c(0,50))  # use the xlimit to avoid some wild predictions

If the prediction were 100% correct, we would get a perfect 450 line. Let’s draw that as a baseline to compare the prediction’s performance.

plot ( pred, test$medv,
       xlab = "Predictions",
       ylab = "Actuals",
       xlim=c(0,50),
       ylim=c(0,50))  # use the xlimit to avoid some wild predictions

abline(0,1)


That’s not a bad fit. Let’s calculate the r2 to have a numeric estimate of how good of a fit we have.

summary(model)$r.squared
0.714425296126006 

That’s an r2 of 0.7. So, we can say that we are 70% accurate in our prediction.


Polynomial Regression

So far, we have seen examples of Linear Regression ( both simple and multi). However, not all types of data can be fit into a Linear Regression. For example, population growth is a good example of a non-linear data. Probably a better word for it is exponential growth. Let’s take a simple example – The population of India and how it is projected to grow in this century. ( This is real data taken from United Nations at https://population.un.org/wpp/Download/Standard/Population/ )

india = read.csv("./data/india_population.csv")
head(india)
year	population
<int&gt;	<int&gt;
1	1950	376325200
2	1951	382245303
3	1952	388538620
4	1953	395160091
5	1954	402077026
6	1955	409269055

Let’s plot this 150 years of data ( since 1950 until 2015 and projected data until 2100)

plot(india$year, india$population,
     xlab = "year",
     ylab = "Population in Billions",
     type = "n")  # does not produce any points or lines
     
lines(india$year, india$population, type="o")

Can you imagine trying to fit this data using Linear Regression ? Well, it is going to be a huge approximation after – say 2040 where the linearity ends. Let’s try it to see how well it fits.

# build the model
model = lm ( india$population ~ india$year, data = india)

# scatter plot
plot(india$year,india$population)

#get slope and intercept
intercept = coef(model)[1]
slope     = coef(model)[2]

# fit line.
abline(a=intercept, b=slope, col = "red")

Well, it does an OK job given that the model is linear, but can we account for the curve somehow ?

This is exactly where polynomials come in. sklearn has built-in methods for this – PolynomialFeature . Let’s use it to fit the data.

# build the model
model = lm ( india$population ~ india$year + I(india$year ^ 2), data = india)

# fit line.
pred = predict(model, newdata = india)

# scatter plot
plot(india$year,india$population, pch=19, col= "blue")
points (india$year, pred, pch=19, col= "red")
summary(model)
Call:
lm(formula = india$population ~ india$year + I(india$year^2), 
    data = india)

Residuals:
       Min         1Q     Median         3Q        Max 
-101455469  -56743742   -2325066   54151912  214380534 

Coefficients:
                  Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)     -4.149e+11  1.331e+10  -31.18   <2e-16 ***
india$year       4.017e+08  1.315e+07   30.55   <2e-16 ***
I(india$year^2) -9.685e+04  3.246e+03  -29.84   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 67780000 on 148 degrees of freedom
Multiple R-squared:  0.9777, Adjusted R-squared:  0.9774 
F-statistic:  3242 on 2 and 148 DF,  p-value: < 2.2e-16


What is really happening here ? A single column ( year ) has now become 3 columns.

We can put both these models on the same plot and you can see how well the prediction hugs the data.

# build the models
model_lin    = lm ( india$population ~ india$year )
model_poly_2 = lm ( india$population ~ india$year + I(india$year ^ 2))

# fit line.
pred_lin    = predict(model_lin, newdata = india)
pred_poly_2 = predict(model_poly_2, newdata = india) 

# scatter plot
plot(india$year,india$population, pch=19, col= "blue")
points (india$year, pred_lin, pch=19, col= "red")
points (india$year, pred_poly_2, pch = 19, col = "green")

#put a legend as well. 
legend("topleft", legend=c("data", "Linear Model", "Polynomial model - 2"),
       col=c("blue", "red", "green"), lty=1)

Visually, you can see the green line (polynomial fit) seems to follow the blue line(data) much better than the red line(linear fit), right ? How about we go one step further and see if increasing the polynomial degree would make this any better.

# build the models
model_lin    = lm ( india$population ~ india$year )
model_poly_2 = lm ( india$population ~ india$year + I(india$year ^ 2))
model_poly_3 = lm ( india$population ~ india$year + I(india$year ^ 2) + I(india$year ^ 3))

# fit line.
pred_lin    = predict(model_lin, newdata = india)
pred_poly_2 = predict(model_poly_2, newdata = india) 
pred_poly_3 = predict(model_poly_3, newdata = india) 

# scatter plot
plot(india$year,india$population, pch=19, col= "blue")
points (india$year, pred_lin, pch=19, col= "red")
points (india$year, pred_poly_2, pch = 19, col = "green")
points (india$year, pred_poly_3, pch = 19, col = "orange")

#put a legend as well. 
legend("topleft", legend=c("data", "Linear Model", "Polynomial model - 2", "Polynomial model - 3"),
       col=c("blue", "red", "green", "orange"), lty=1)


Essentiall, we are doing this now.

The orange line (polynomial fit, degree = 3) is hugging the actual data curve (blue) much closer than the green line ( polynomial fit , degree = 2 ), right ? Let’s do the numbers ( r2 score ) as well, to really know the degree of fit.

cat ( summary(model_lin)$r.squared, " --&gt; R squared - Linear fit", "\n")
cat ( summary(model_poly_2)$r.squared, " --&gt; R squared - polynomial fit ( degree = 2)", "\n")
cat ( summary(model_poly_3)$r.squared, " --&gt; R squared - polynomial fit ( degree = 3)", "\n")

0.8434444  --&gt; R squared - Linear fit 
0.9776831  --&gt; R squared - polynomial fit ( degree = 2) 
0.9953978  --&gt; R squared - polynomial fit ( degree = 3) 

Challenge

Use polynomial regression to fit the following parameters on the boston Housing dataset.

  • LSTAT ( Lower Status Population percentage )
  • MEDV ( Median house value )

After modeling ,

  • Plot both linear and polynomial model ( degree = 2 ) to visually show how they perform
  • Compare the r2 values between linear and polynomial model
  • Verify if increasing the degree of the polynomial regression ( say degree = 3 or 4 ) increases the performance.

Solution

# build the models
model_lin    = lm ( medv ~ lstat , data = boston_housing)
model_poly_2 = lm ( medv ~ lstat + I(lstat ^ 2), data = boston_housing)
model_poly_3 = lm ( medv ~ lstat + I(lstat ^ 2) + I(lstat ^ 3), data = boston_housing)

# fit line.
pred_lin    = predict(model_lin, newdata = boston_housing)
pred_poly_2 = predict(model_poly_2, newdata = boston_housing) 
pred_poly_3 = predict(model_poly_3, newdata = boston_housing) 

# scatter plot
lstat = boston_housing$lstat
medv  = boston_housing$medv

plot   (lstat,medv, pch=19, col= "blue")
points (lstat, pred_lin,  , pch=19, col= "red")
points (lstat, pred_poly_2, pch = 19, col = "green")
points (lstat, pred_poly_3, pch = 19, col = "orange")

#put a legend as well. 
legend("topleft", legend=c("data", "Linear Model", "Polynomial model - 2", "Polynomial model - 3"),
       col=c("blue", "red", "green", "orange"), lty=1)

0.544 – r2 score with linear modeling
0.640 – r2 score with polynomial modeling ( degree = 2 )
0.657 – r2 score with polynomial modeling ( degree = 3 )

That’s just a marginal improvement over second degree polynomial regression. So, you don’t need to go beyond second degree.

How about a 10th degree fit ?

# build the models
model_lin    = lm ( medv ~ lstat , data = boston_housing)
model_poly_2 = lm ( medv ~ lstat + I(lstat ^ 2), data = boston_housing)
model_poly_3 = lm ( medv ~ lstat + I(lstat ^ 2) + I(lstat ^ 3), data = boston_housing)
# notice how we are specifying the polynomial fit
model_poly_10 = lm ( medv ~ poly(lstat,10), data = boston_housing)

# fit line.
pred_lin    = predict(model_lin, newdata = boston_housing)
pred_poly_2 = predict(model_poly_2, newdata = boston_housing) 
pred_poly_3 = predict(model_poly_3, newdata = boston_housing) 
pred_poly_10 = predict(model_poly_3, newdata = boston_housing) 

# scatter plot
lstat = boston_housing$lstat
medv  = boston_housing$medv

plot   (lstat,medv, pch=19, col= "blue")
points (lstat, pred_lin,  , pch=19, col= "red")
points (lstat, pred_poly_2, pch = 19, col = "green")
points (lstat, pred_poly_3, pch = 19, col = "orange")
points (lstat, pred_poly_10, pch = 19, col = "cyan")

#put a legend as well. 
legend("topleft", legend=c("data", "Linear Model", "Polynomial model - 2", "Polynomial model - 3"),
       col=c("blue", "red", "green", "orange"), lty=1)


0.544 – r2 score with linear modeling
0.640 – r2 score with polynomial modeling ( degree = 2 )
0.657 – r2 score with polynomial modeling ( degree = 3 )
0.684 – r2 score with polynomial modeling ( degree = 10 )

See, even if you go to a 10th degree polynomial fitting, the improvement in r2 is just about 0.04 from a second degree polynomial fitting.


Overfitting

Sometimes, where there is not enough data, the model might tend to overfit. Look at the example data below. We are simulating a sin wave.

x = seq( from = 10, to = 100, by=5)
y = sin(x*pi/180)


Let’s plot it to see how it it looks

plot(x,y, pch = 19, col = "blue")

Let’s introduce a bit of variance to make the data a bit realistic.

variance = runif(length(x), min=0.1, max=0.3)
y = y + variance

plot(x,y, pch = 19, col = "blue")

Now, let’s see how a 2nd degree polynomial regression fits.

# build the models
model_poly_2 = lm ( y ~ x + I(x ^ 2))

# fit line.
pred_poly_2 = predict(model_poly_2, newdata = data.frame(x,y)) 

# scatter plot
plot   (x,y, pch=19, col= "blue")
lines (x, pred_poly_2, pch = 19, col = "green", lty=1, lwd = 3)

#put a legend as well. 
legend("topleft", legend=c("data", "Polynomial model - 2"),
       col=c("blue", "green"), lty=1)

This looks to be a good fit, right ? Now, let’s try a higher order polynomial fit ( say degree = 10 )

# build the models
model_poly_2 = lm ( y ~ x + I(x ^ 2))
model_poly_10 = lm ( y ~ poly(x,10))

# fit line.
pred_poly_2 = predict(model_poly_2, newdata = data.frame(x,y)) 
pred_poly_10 = predict(model_poly_10, newdata = data.frame(x,y)) 

# scatter plot
plot   (x,y, pch=19, col= "blue")
lines (x, pred_poly_2, pch = 19, col = "green", lty=1, lwd = 3)
lines (x, pred_poly_10, pch = 19, col = "red", lty=1, lwd = 3)

#put a legend as well. 
legend("topleft", legend=c("data", "Polynomial model - 2", "Polynomial model - 10"),
       col=c("blue", "green", "red"), lty=1)

Overfitting typically happens when the model is trying to work too hard for the data. And why is it a problem ? Overfitting tries to fit the data too much and hence will not work well for new datasets. Think of overfitting as localizing the solution for the test datasets – it is more or less memorizing the data, not generalizing a solution for the dataset. Obviously, it will not work as well when model is used on real data set. We will see more examples of these when we see other machine learning algorithms down the line.

If you are wondering why the simple linear regression is able to learn the model just enough, but the higher degree polynomial regression is over learning it, that is because the higher order polynomial regression has the flexibility to learn more ( as compared to a linear or second order polynomial regression ). This is actually, good, except that it is not able to discern noise from data. Let’s increase the dataset size to see if the same 15 degree polynomial regression peforms better than a second order.

Increase sample size to 1000

x = seq( from = 10, to = 100, by=0.1)
y = sin(x*pi/180)

variance = runif(length(x), min=0.1, max=0.3)
y = y + variance

plot(x,y, pch = 19, col = "blue")
# build the models
model_poly_2 = lm ( y ~ x + I(x ^ 2))
model_poly_15 = lm ( y ~ poly(x,15))

# fit line.
pred_poly_2 = predict(model_poly_2, newdata = data.frame(x,y)) 
pred_poly_15 = predict(model_poly_10, newdata = data.frame(x,y)) 

# scatter plot
plot   (x,y, pch=19, col= "blue")
lines (x, pred_poly_2, pch = 19, col = "green", lty=1, lwd = 3)
lines (x, pred_poly_15, pch = 19, col = "red", lty=1, lwd = 3)

#put a legend as well. 
legend("topleft", legend=c("data", "Polynomial model - 2", "Polynomial model - 15"),
       col=c("blue", "green", "red"), lty=1)


You see, data size did not matter, still the 15th order polynomial regression still overfits the data. The reason is that for the amount of data and noise, a second or 3rd degree polynomial has enough power to capture the complexity of this data. Probably for more complicated data sets, an increase in degree might capture the complexity better.


Detect Overfitting

If you look at the pictures above, we were able to clearly see an overfit. This is because it is a 2 dimensional dataset. Most data is multi-dimensional in real life – in which case we cannot be able to identify an overfit, but just looking at a plot. There are basically 2 methods to identify an overfit.

1. Chose a simpler model.

Look at a measure of the score ( r2) – if there is not a significant difference, then better go for a simpler model.

x = seq( from = 10, to = 100, by=0.1)
y = sin(x*pi/180)

variance = runif(length(x), min=0.1, max=0.3)
y = y + variance

# build the models
model_poly_2 = lm ( y ~ x + I(x ^ 2))
model_poly_15 = lm ( y ~ poly(x,15))

# fit line.
pred_poly_2 = predict(model_poly_2, newdata = data.frame(x,y)) 
pred_poly_15 = predict(model_poly_10, newdata = data.frame(x,y)) 

# print out r-squared
print ( summary(model_poly_2)$r.squared)
print ( summary(model_poly_15)$r.squared)
[1] 0.9485446
[1] 0.9504134
r2 score	2nd Degree Polynomial	15th Degree Polynomial
0.948	0.950

The r2 score of the 2nd degree fit is almost as better as the 15th degree polynomial model. So, the simpler model ( 2nd degree model) wins.


2. Check the model performance across Training and Test datasets.

Another method to identify an overfit is by validating how well the model is performing against the training dataset vs the test dataset.

# Split the data to train and test

# 1. Get the number of rows 
rows = nrow(boston_housing)

# 2. sample out 80% of the numbers between 1 and the number of rows
index = sample(1:rows, rows*0.8)

# 3. get the training data using the indices that correspond to 80% of the dataset
train = boston_housing[index,]

# 4. The remaining part of the data set becomes the test dataset. 
test = boston_housing[-index,]

Fit a 2nd degree and 15th degree polynomial regression

# build the models
model_lin     = lm ( medv ~ lstat , data = train)
model_poly_2  = lm ( medv ~ lstat + I(lstat ^ 2), data = train)
model_poly_15 = lm ( medv ~ poly(lstat,15), data = train)

Compare the accuracy scores across both the scenarios (2nd degree and 15th degree polynomial regression) by predicting the test data set based on modeling the training dataset.

mape_2 = mean ( abs ( ( test$medv - predict(model_poly_2, test) ) / test$medv * 100 ) , na.rm = TRUE)
mape_15 = mean ( abs ( ( test$medv - predict(model_poly_15, test) ) / test$medv * 100 ) , na.rm = TRUE)
cat ( mape_2, mape_15)
17.64013 15.44444

As you can see the r-square score has decreased (slightly though ) for the 15th degree polynomial. That goes to show that the 15th degree polynomial is definitely overfitting.


Linear Regression Assumptions

Homoskedasticity

In plain english, what this means is that, the variance between the predicted and the actual values should be uniform across the spectrum.

Homoskedasticity = Uniform Variance

Let’s take the relationship between the number of rooms (“RM”) and the median price value (“MEDV”) and try to fit a linear regression model.

model_lin     = lm ( medv ~ rm , data = boston_housing)

Let’s try to predict the house price now that our model is ready.

y_pred = predict ( model_lin, newdata = boston_housing)

Now that we have predicted the housing prices, let’s see how they compare to the actual house price. The difference between the predicted and actual values is called residuals.

residuals_rm = y_pred - boston_housing$medv

Let’s plot the residuals vs the predicted values.

plot ( y_pred, residuals_rm, 
       xlim  = c(1,50),
       xlab  = "Predicted Median House Price",
       ylab  = "Residuals",
       main  = "Fitted vs Residuals")


What this plot goes to show is that the variance ( residuals ) is not uniform across the spectrum of house prices. For example, the variance is high between 10K and 35K. But that’s probably because the bulk of the data lies there. You can strip away outliers and restrict our dataset to between those values and try again.

plot ( y_pred, residuals_rm, 
       xlim  = c(1,50),
       ylim  = c(-10,10),
       xlab  = "Predicted Median House Price",
       ylab  = "Residuals",
       main  = "Fitted vs Residuals")

This actually looks ok. However, the ideal scenario is what is called the night sky – like a simulated picture below.

# Generate 10K random numbers between 0 and 100.
x = runif(100, min = 0, max = 100)
y = runif(100, min = 0, max = 100)
plot(x,y)

What homoskedasticity does is ensure that the prediciteve capacity is uniform across all the entire range of the predictors.

This is the ideal plot that confirms homeskedasticity – Also called the Night sky. If you look at the variance(residuals), they are uniformly distributed across the entire spectrum of house prices. What this means is that we can confidently predict the house price across the entire spectrum with the same level of accuracy.

Now, compare it to the previous Fitted vs Residuals plot to understand it better.


Normal Distribution of Residuals

Another assumption is that the residuals should be normally distributed. You can quickly verify that using a histogram,

hist(residuals_rm,breaks=20)

This looks almost normal, but a quick quantitative way to compare how much a distribution adheres to a normal distribution is by using a Q-Q plot. Or using a Q-Q plot. Q-Q plot is an acronym for Quantile-Quantile plot. The way it is calculated is by comparing the quantiles of the sample against the quantiles of a normal distribution.

# First sort the residuals ( sample )
residuals_rm_s = sort(residuals_rm)

# How big is the dataset ?
length(residuals_rm_s)

# create a normal distribution as big as the residuals dataset
norm=rnorm(length(residuals_rm_s), mean = 10, sd= 3)

# and sort it
norm_s = sort(norm)

# Scatter plot of normal distribution vs residuals
plot(norm_s,residuals_rm_s,
     main = "Quantile - Quantile plot ( Q-Q )",
     xlab = "Normal Distribution",
     ylab = "Residuals")
452

What this plot goes to show is that the residuals are almost normally distributed except at the fringes. So, if you can identify the outliers and trim them, this dataset satisfies the Normal Distribution of Residuals assumption of Linear Regression.


No Multicollinearity

Another big word – and the explanation is a bit tricky as well.

Multicollinearity – If we are trying to predict z from predictors x and y (z ~ x + y), then x and y should have orthogonal predictive power. Meaning, each of them should have independent predictive power. However, if x and y are somehow related, then we have a situation called multicollinearity. The predictors should not be able to predict each other.

Let’s create a dummy dataset to understand this better.

x – A normal distribution y – A slight variation of (with a bit of random variation added to x )

z – Another variation of x + y (with a bit of random variation added to x) so that we simulate a relationship between x, y and z

var = rnorm ( 100, mean = 5, sd= 2)

x = rnorm(100, mean = 100, sd = 10)
y = x +var

var = rnorm ( 100, mean = 5, sd= 2)
z = x + var +y

X = data.frame(x,y,z)

model = lm ( z ~ ., data = X )
summary(model)

Call:
lm(formula = z ~ ., data = X)

Residuals:
    Min      1Q  Median      3Q     Max 
-4.9027 -1.2531 -0.0404  1.2897  4.6848 

Coefficients:
            Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  9.05493    1.99755   4.533 1.66e-05 ***
x            0.94709    0.09944   9.524 1.43e-15 ***
y            1.01445    0.09687  10.472  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.043 on 97 degrees of freedom
Multiple R-squared:  0.9907, Adjusted R-squared:  0.9905 
F-statistic:  5176 on 2 and 97 DF,  p-value: < 2.2e-16

As we see, we have a pretty good r2 because we have modelled the data almost as z = x + y + a bit of variance. Now, let’s calculate the VIF of the model.

car::vif(model)
x26.1425422526061y26.1425422526061 

These are basically very high values. Any value above 5 is pretty high in terms of collinearity. What this shows is that the predictors x and y are basically highly dependent on each other (as opposed to their capacity to predict the response variable).

Collinearity is predicted using a parameter called VIF, the formula for which is

Point being, as long as there is a relationship between the variables, it is displayed with a VIF or Variance Inflation Factor > 2. Let’s tae another example.

Collinearity and correlation are different though. Look at the picture below. There is some correlation between

  • Total Power Consumption <–> Number of rooms
  • Total Power Consumption <–> Power consumption per room

But there is a perfect collinearity between

  • Total Power Consumption <–> Number of rooms & Power consumption/room

Correlation is between 2 variables, while collinearity is between a single variable and a combination of variables.

Now that we understand collinearity, let’s find out how many variables have a collinear relationship in the Boston Housing dataset.

model = lm (  medv ~ ., data = boston_housing )
summary(model)
Call:
lm(formula = medv ~ ., data = boston_housing)

Residuals:
     Min       1Q   Median       3Q      Max 
-19.3991  -2.5377  -0.5848   1.7101  27.9916 

Coefficients:
              Estimate Std. Error t value Pr(&gt;|t|)    
(Intercept)  20.866365   5.477190   3.810 0.000159 ***
crim         -0.238025   0.223853  -1.063 0.288229    
zn            0.038221   0.013372   2.858 0.004462 ** 
indus         0.051356   0.059643   0.861 0.389683    
chas          2.435048   0.829794   2.935 0.003516 ** 
nox         -11.657986   3.928009  -2.968 0.003163 ** 
rm            5.110158   0.454838  11.235  < 2e-16 ***
age          -0.006094   0.013101  -0.465 0.642031    
dis          -1.271514   0.195766  -6.495 2.26e-10 ***
rad           0.294444   0.085387   3.448 0.000619 ***
tax          -0.011360   0.003612  -3.145 0.001776 ** 
ptratio      -0.831030   0.127000  -6.544 1.68e-10 ***
b             0.012314   0.003513   3.505 0.000503 ***
lstat        -0.520753   0.057428  -9.068  < 2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 4.553 on 438 degrees of freedom
  (54 observations deleted due to missingness)
Multiple R-squared:  0.7405, Adjusted R-squared:  0.7328 
F-statistic: 96.16 on 13 and 438 DF,  p-value: < 2.2e-16
for ( name in names(car::vif(model)) ) {
    cat ( name ,car::vif(model)[name], "\n")
}

crim 6.791148 
zn 2.30185 
indus 3.575504 
chas 1.07249 
nox 4.348241 
rm 2.001147 
age 2.95385 
dis 3.643646 
rad 9.025998 
tax 6.501212 
ptratio 1.698408 
b 1.261811 
lstat 2.719365 

So, all of the following variables have a high collinearity.

  • 9 – VIF for RAD
  • 6.5 – VIF for TAX

How to fix Collinearity?

  • Iteratively remove the predictor with the highest VIF. So, in this case, remove either TAX or PTRATIO and do a re-run.
  • Use dimentionality reduction ( say PCA ) to reduce the number of predictors ( More on PCA & dimensionality reduction later)
model = lm (  medv ~ crim + zn + indus + chas + nox + rm + age + dis + tax + ptratio + b + lstat, data = boston_housing )
for ( name in names(car::vif(model)) ) {
    cat ( name ,car::vif(model)[name], "\n")
}
crim 3.808979 
zn 2.237287 
indus 3.409872 
chas 1.058646 
nox 4.347899 
rm 1.984223 
age 2.942636 
dis 3.638249 
tax 4.254584 
ptratio 1.651156 
b 1.261027 
lstat 2.683693 

Now, all the variables have a VIF of less than 5 ( Less than 2 or 3 is ideal though )

Python Regular Expressions

Python Regular Expressions


  Data visualization

Regular expressions are a very important tool for a data scientist or a machine learning engineer. Regular expressions is dry and boring topic to learn. But the problems Regular Expressions solve are very real and interesting. So, we will learn Regular Expressions with a problem solving approach. We will define a series of small problems, solve them step by step and with each problem, we will learn some of the aspects of Regular Expressions.

If you are not comfortable with this kind of non-linear approach, this course might not be for you.

Extract phone numbers v1

Problem – Extract all the phone numbers from this text.

numbers = '''There are 3 phone numbers that you need to call in case of medical emergency. 
For casualty, call 408-202-2222. For elderly emergencies, call 408-203-2222 and
for everything else call 408-202-4444
'''


Let’s take a much simpler case – just a list of 3 phone numbers and no text. If the patter is at the beginning of the string, you can use the match ( ) function. Also, match ( ) function only returns the first occurance.

import re  #import for regular expressions

numbers = '''408-202-2222
             408-203-2222
             408-202-4444'''
match = re.match("408-\d\d\d-\d\d\d\d",numbers)
print ( match )


match ( ) function returns a match object. It contains the span (the start and end of the match) and the actual match itself. Use the group ( ), span( ), start( ) and end( ) functions to get the specifics of the match.

print ( "matching text  = ", match.group())
print ( "start position = ", match.start())
print ( "end position   = ",match.end())
matching text  =  408-202-2222
start position =  0
end position   =  12

Let’s try something slightly different with match ( ) function. Will it be able to pick the pattern from this text ?

numbers = '''
             408-202-2222
             408-203-2222
             408-202-4444'''

match = re.match("408-\d\d\d-\d\d\d\d",numbers)
print ( match )
None

No. Why is that ? match ( ) function can only find out the pattern at the beginning of the string. In this case, the first line is a blank line. So, the match ( ) function fails. In cases like this, use the search ( ) function. In contrast to the match ( ) function, the search ( ) function can extract patterns anywhere in the text.

match = re.search("408-\d\d\d-\d\d\d\d",numbers)
print ( match )

But, we are still getting the first match only. We wanted all the matches, right ? Botht he search ( ) and match ( ) functions return the first match only. To get all the matches, we will have to use other functions like findall ( ) or finditer ( ).

matches = re.findall("408-\d\d\d-\d\d\d\d",numbers)

print ( matches )
matches = re.findall("408-\d\d\d-\d\d\d\d",numbers)

print ( matches )
['408-202-2222', '408-203-2222', '408-202-4444']

That’s much better, right ? The findall ( ) function returns all the matches it finds in the text. The other function finditer ( ) just returns the same results in an iterable.

matches = re.finditer("408-\d\d\d-\d\d\d\d",numbers)

for match in matches : 
    print ( match )
<re.Match object; span=(14, 26), match='408-202-2222'&gt;
<re.Match object; span=(40, 52), match='408-203-2222'&gt;
<re.Match object; span=(66, 78), match='408-202-4444'&gt;

If you wanted just the match, use the group () function to extract the matching text.

matches = re.finditer("408-\d\d\d-\d\d\d\d",numbers)

for match in matches : 
    print ( match.group() )

408-202-2222
408-203-2222
408-202-4444


Now, we can solve the problem we started out with.

numbers = '''There are 3 phone numbers that you need to call in case of medical emergency. 
For casualty, call 408-202-2222. For elderly emergencies, call 408-203-2222 and
for everything else call 408-202-4444
'''

matches = re.finditer("408-\d\d\d-\d\d\d\d",numbers)

for match in matches : 
    print ( match.group() )

408-202-2222
408-203-2222
408-202-4444


In fact, even if the starting phone number is not always constant, like a 408 in this case, still we should be able to extract the matches.

numbers = '''There are 3 phone numbers that you need to call in case of medical emergency. 
For casualty, call 408-202-2222. For elderly emergencies, call 408-203-2222 and
for everything else call 800-202-4444
'''
matches = re.finditer("\d\d\d-\d\d\d-\d\d\d\d",numbers)

for match in matches : 
    print ( match.group() )
408-202-2222
408-203-2222
800-202-4444

See, all the numbers have been extracted.

Points to Remember

  • \d represents a single digit
  • match ( ) function returns the first match only, but only start at the beginning of the line.
  • search ( ) function returns the first match only.
  • findall ( ) and finditer ( ) functions return all the matches.

Extract phone numbers v2

Problem – Extract all the phone numbers from this text message.

numbers = '''408-222-2222,
             (408)-333-3333,
             (800)-444-4444'''


Let’s try what we know so far.

match = re.findall("\d\d\d-\d\d\d-\d\d\d\d",numbers)
print ( match )
['408-222-2222']

But this only matches the phone numbers without brackets. What about the ones with paranthesis ? We can try something like this.

match = re.findall("(\d\d\d)-\d\d\d-\d\d\d\d",numbers)
print ( match )

['408']

oops.. it is not working. Why ? Because, paranthesis represents a special character – It is used to make groups out of regular expressions (which, we will see later). To represent an actual paranthesis, escape it with a backslash.

match = re.findall("\(\d\d\d\)-\d\d\d-\d\d\d\d",numbers)
print ( match )
['(408)-333-3333', '(800)-444-4444']

OK. Now, we got the phone numbers with paranthesis, but we missed the ones without paranthesis. We want to capture either of these combinations. That’s when we use the OR operator. In regular expressions, we use the pipe operator (|) to represent either/or type of patterns.

match = re.findall("\(\d\d\d\)-\d\d\d-\d\d\d\d|\d\d\d-\d\d\d-\d\d\d\d",numbers)
print ( match )
['408-222-2222', '(408)-333-3333', '(800)-444-4444']

There we go – we were able to capture both the patterns. However, the \d in the pattern repeats a lot making the string too long. Instead, we can use quantifiers to specify how long a particular sub-pattern can be. For example, the following pattern is exactly equivalent to the pattern above.

match = re.findall("\(\d{3}\)-\d{3}-\d{4}|\d{3}-\d{3}-\d{3}",numbers)
print ( match )

['408-222-222', '(408)-333-3333', '(800)-444-4444']

As you can see, quantifiers make the pattern much more compact in case there is a lot of repetition.

Points to Remember

  • If paranthesis (or ) needs to be used in the pattern, escape them with a backslash ( \ ). This is done because, paranthesis is used to represent groups, which we will look into later.
  • | or pipe character is used to represent a logical OR operator in regular expressions.
  • { } Flower brackets are used to quantify the number of occurrances of a particular part of a regular expression. For example, a{3} is used to indicate that exactly 3 a‘s should be looked for.

Extract phone numbers v3

Problem – Extract all the phone numbers from this text message.

numbers = '''408-222-2222,
             408.333.3333,
             800 444 4444'''

match = re.findall("\d{3}-\d{3}-\d{4}|\d{3}.\d{3}.\d{3}|\d{3}\d{3}\d{3}",numbers)
print ( match )

['408-222-2222', '408.333.333', '800 444 444']


This works. But, can we make it any more concise ? There seems to be a lot of repetition. This is where character sets come in. In this case, the separator between the phone numbers is either a dash or a dot or a blank space. Can we somehow represent all of these characters to be searched for as separators, as opposed to specifying each pattern separately ?

match = re.findall("\d{3}[-.]\d{3}[-.]\d{4}",numbers)
print ( match )
['408-222-2222', '408.333.3333']

But what about phone numbers with spaces ? How do we represent a space in regular expressions ? We use the special character \s.

match = re.findall("\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )

['408-222-2222', '408.333.3333', '800 444 4444']

There we go – we are able to capture all of the phone numbers.

Points to Remember

  • Characters enclosed in [] (square brackets) are called character sets. Regular expressions search for any character inside the charater set for matches.
  • \s is used to represent a space or blank character.

Extract phone numbers v4

Problem – Extract all the phone numbers from this text.

numbers = '''   408-222-2222,
             1 408.333.3333,
             1 408-444-4444,
             1 (800) 444 4444'''

match = re.findall("\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )
['408-222-2222', '408.333.3333', '408-444-4444']

But, how about the 1 before the numbers ? How do we capture them ? Some phone numbers have it and some don’t. That’s where the ? quantifier comes in. If a pattern needs to be checked for occurance zero or 1 time, use the ? quantifier.

match = re.findall("1?\s\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )
[' 408-222-2222', '1 408.333.3333', '1 408-444-4444']

Much better. Now, what about the 800 number with paranthesis ? How do we look for paranthesis ? We have seen previously that paranthesis is a special character and to extract that we need to escape it. Let’s try that.

match = re.findall("1?\s\(?\d{3}\)?[\s]\d{3}[\s]\d{3}",numbers)
print ( match )
['1 (800) 444 444']

Alright, we got that as well. Now, to combine all of these, we can use the OR operator.

match = re.findall("1?\s\(?\d{3}\)?[\s]\d{3}[\s]\d{3}|1?\s\d{3}[-.\s]\d{3}[-.\s]\d{4}",numbers)
print ( match )
[' 408-222-2222', '1 408.333.3333', '1 408-444-4444', '1 (800) 444 444']

Or, we can combine them like this.

match = re.findall("1?\s\(?\d{3}\)?[-.\s]\d{3}[-.\s]\d{3}",numbers)
print ( match )
[' 408-222-222', '1 408.333.333', '1 408-444-444', '1 (800) 444 444']

Learning

  • ? is used to represent a pattern that repeats zero or one time. It is a type of quantifier like {n}

Extract phone numbers v5

Problem – Extract all the phone numbers from this text.

numbers = '''+1  408-222-2222,    
             +91 98989-99898,
             +86 10-1234-5678,
             +263 10-234-5678'''

The first one is a US phone number, the second one is India and the third one is Chinese number. How to extract these. Let’s start with the plus (+) at the beginning of the string. How to extract that ?

match = re.findall("+",numbers)
---------------------------------------------------------------------------
error                                     Traceback (most recent call last)
<ipython-input-164-2fe5ce5c5168&gt; in <module&gt;
----&gt; 1 match = re.findall("+",numbers)

c:\program files\python37\lib\re.py in findall(pattern, string, flags)
    221 
    222     Empty matches are included in the result."""
--&gt; 223     return _compile(pattern, flags).findall(string)
    224 
    225 def finditer(pattern, string, flags=0):

c:\program files\python37\lib\re.py in _compile(pattern, flags)
    284     if not sre_compile.isstring(pattern):
    285         raise TypeError("first argument must be string or compiled pattern")
--&gt; 286     p = sre_compile.compile(pattern, flags)
    287     if not (flags &amp; DEBUG):
    288         if len(_cache) &gt;= _MAXCACHE:

c:\program files\python37\lib\sre_compile.py in compile(p, flags)
    762     if isstring(p):
    763         pattern = p
--&gt; 764         p = sre_parse.parse(p, flags)
    765     else:
    766         pattern = None

c:\program files\python37\lib\sre_parse.py in parse(str, flags, pattern)
    928 
    929     try:
--&gt; 930         p = _parse_sub(source, pattern, flags &amp; SRE_FLAG_VERBOSE, 0)
    931     except Verbose:
    932         # the VERBOSE flag was switched on inside the pattern.  to be

c:\program files\python37\lib\sre_parse.py in _parse_sub(source, state, verbose, nested)
    424     while True:
    425         itemsappend(_parse(source, state, verbose, nested + 1,
--&gt; 426                            not nested and not items))
    427         if not sourcematch("|"):
    428             break

c:\program files\python37\lib\sre_parse.py in _parse(source, state, verbose, nested, first)
    649             if not item or item[0][0] is AT:
    650                 raise source.error("nothing to repeat",
--&gt; 651                                    source.tell() - here + len(this))
    652             if item[0][0] in _REPEATCODES:
    653                 raise source.error("multiple repeat",

error: nothing to repeat at position 0

oops.. doesn’t work. That is because, + is a special character. It is used to represent a quantifer. + means that a patter repeats one or more time. So, to find + as a pattern, you would have to escape it.

match = re.findall("\+", numbers)
print ( match )

OK, Now, we are able to get the + in the string. Let’s extract the country code next. It is the set of numbers right next to the + symbol. It could be 1 (like US ) or 2 (like India, China ), or may be 3 (Zimbabwe is +263 ). We can use the flower brackets to specify a pattern length of 1 to 3 like so –

{1,3}

match = re.findall("\+\d{1,3}", numbers)
print ( match )


Next, we have a set of numbers separated by dashes. However, the count of numbers between the dashes is arbitrary. So, we need some kind of a quantifier again to find out repetitive pattern of count between 1 and n. We could just assume a higher nuber say 5 for n and proceed like so.

match = re.findall("\+\d{1,3}\s{1,3}\d{1,6}-\d{1,6}-\d{1,6}", numbers)
print ( match )

Instead of using the {m,n} quantifier to identify digits that repeat atleast once, you can use the quantifier +.

match = re.findall("\+\d+\s+\d+-\d+-\d+", numbers)
print ( match )


We are still missing another number , +91 98989-99898. This is because, the number is divided into 2 parts (and not 3 parts separated by dashes). So, a simple solution would be to create another pattern and do an OR operation. That should capture all of the possible phone numbers in this case.

match = re.findall("\+\d+\s+\d+-\d+-\d+|\+\d+\s+\d+-\d+", numbers)
print ( match )

Learning

  • {m,n} is used to represent a pattern that repeats m to n number of times. It is a type of quantifier.
  • Since + is a special character (used to identify patterns that repeat 1 or more times), to identify + itself, escape it with a backslash (\)

Extract emails

Problem – Extract all the emails from this text.

text = '''   accounts@boa.com,    
             sales@boa.com,
             cancellations@tesla.com,
             accounts@delloitte.com,
             cancellations@farmers.com,
             accounts@dell@com'''

To solve text based patterns, one of the fundamental character set is \w. It represents any character that can be found in a word – it could be alphabetic or numeric or underscore. These are the only 3 types of characters that \w can find. For example, a single \w on this text, basically captures all the word characters (a to z characters, 0-9 digits and underscore ). You can see that in the output below.

matches = re.findall("\w",text)
print ( matches )

We need to step up from letters, to identify words. A word is just a repetition of a set of letters, numbers and underscores. So, we use a quantifier + to identify a word.

matches = re.findall("\w+",text)
print ( matches )

Now that we have all the words, all we have to do is to put together the pattern that includes the @ symbol and dot.

matches = re.findall("\w+@\w+.\w+",text)
print ( matches )


We are almost there, except the last email – accounts@dell@com. This is not a valid email. So, why is our pattern capturing it ? When we mentioned dot (.) in our pattern (\w+@\w+.\w+), it basically captures any character. So, in order to capture a dot, all we have to do is to escape it – prepend it with a backslash (\)

matches = re.findall("\w+@\w+\.\w+",text)
print ( matches )

There you go, we have succesfully found out all the emails in the text.

Learning

  • \w is used to represent a character in a word – it could be an alphabet (a-z) or a number ( 0-9) or an underscore.
  • \w+ – to identify words, all you have to do is append \w with a plus (+).
  • . (dot) is used to identify ANY character. It is a special character. To actually identify a . (dot) itself, just escape it with a backslash (\).

Extract zip codes

Problem : Say we have a text with US zip codes. The valid format for US zip codes are

  • 99999
  • 99999-9999

where 9 represents any digit. Write a regular expression to extract all zip codes from the text.

text = '''08820, 08820-1245, zip code, 98487, 98487-0000, ABCD  '''

matches = re.findall ( "\d{5}-\d{4}|\d{5}", text)
print ( matches )
html = '''<font size=2&gt;
          <font  size=2&gt;
          <font size = 2&gt;
          < font  size=2 &gt;
          <font     size    =   2   &gt;'''

Quiz Which of the following regular expression captures all of the above combinations. Observe the spaces precisely.

  • “<\s+font

Exercise : Say we have a text with Canadian zip codes. The format for canadian zip codes is

  • A1A A1A

where A represents an alphabet and 1 represents any digit. There is a space at the 4th character.

text = '''M1R 0E9
          M3C 0C1
          M3C 0C2
          M3C 0C3
          M3C 0E3
          M3C 0E4
          M3C 0H9
          M3C 0J1
          1M1 A1A
          11M 1A1
          M11 A1A
          M3C0J1
          M3C JJ1'''

# Test - The last five elements should NOT match

Solution

matches = re.findall ("[A-Z]\d[A-Z] \d[A-Z]\d", text)
print ( matches )

Log file analysis

Problem – Say there is a web server log file, find out how many times the login file was succesfully hit and how many times it failed. For now, we will work with a sample snippet from the file. We will work with the real file in the next challenge.

log = '''
10.128.2.1 [29/Nov/2017:06:58:55 GET /login.php HTTP/1.1 Status Code  - 302
10.128.2.1 [29/Nov/2017:06:59:02 POST /process.php HTTP/1.1 Status Code  - 302
10.128.2.1 [29/Nov/2017:06:59:03 GET /home.php HTTP/1.1 Status Code  - 200
10.131.2.1 [29/Nov/2017:06:59:04 GET /js/vendor/moment.min.js HTTP/1.1 Status Code  - 200
10.130.2.1 [29/Nov/2017:06:59:06 GET /bootstrap-3.3.7/js/bootstrap.js HTTP/1.1 Status Code  - 200
10.130.2.1 [29/Nov/2017:06:59:19 GET /profile.php?user=bala HTTP/1.1 Status Code  - 200
10.128.2.1 [29/Nov/2017:06:59:19 GET /js/jquery.min.js HTTP/1.1 Status Code  - 200
10.131.2.1 [29/Nov/2017:06:59:19 GET /js/chart.min.js HTTP/1.1 Status Code  - 200
10.131.2.1 [29/Nov/2017:06:59:30 GET /edit.php?name=bala HTTP/1.1 Status Code  - 200
10.131.2.1 [29/Nov/2017:06:59:37 GET /logout.php HTTP/1.1 Status Code  - 302
10.131.2.1 [29/Nov/2017:06:59:37 GET /login.php HTTP/1.1 Status Code  - 200
10.130.2.1 [29/Nov/2017:07:00:19 GET /login.php HTTP/1.1 Status Code  - 200
10.130.2.1 [29/Nov/2017:07:00:21 GET /login.php HTTP/1.1 Status Code  - 200
10.130.2.1 [29/Nov/2017:13:31:27 GET / HTTP/1.1 Status Code  - 302
10.130.2.1 [29/Nov/2017:13:31:28 GET /login.php HTTP/1.1 Status Code  - 200
10.129.2.1 [29/Nov/2017:13:38:03 POST /process.php HTTP/1.1 Status Code  - 302
10.131.0.1 [29/Nov/2017:13:38:04 GET /home.php HTTP/1.1 Status Code  - 200'''

solution

pattern = "(\d+\.\d+\.\d+\.\d+).*(login.php)\s(HTTP).*-\s(\d{3})"

matches = re.findall (pattern, log)
print (matches)
count_200     = 0
count_not_200 = 0

for match in matches : 
    if match[3] == "200" :
        count_200 += 1
    else : 
        count_not_200 += 1
        
success_perc = ( count_200 / (count_200 + count_not_200) ) * 100
print ( " login was succesfully hit ", success_perc , "% of time")

Learning

  • (…) is used to represent groups in a regular expression
  • There can be multiple groups in a single regular expression
  • Each of the groups can be extracted out per each match of the regular expression
  • . (dot) represents ANY character

Challenge

Say there is a web server log file, find out how many times the login file was succesfully hit and how many times it failed. The file is available in the data directory. If the HTTP code ( at the end of each line in the log file ) is 200 the page is succesfully rendered. Otherwise, it is a failure.

Solution

# read file

data = [] # will contain the log data as a list
with open ( "./data/log_file.txt", "r") as f : 
    for line in f : 
        data.append(line)
        
# print the read data
for line in data [0:5]: 
    print ( line, end="")
# parse the data using regular expression and find matches for login.php
import re

login_data = []

pattern = "(\d+\.\d+\.\d+\.\d+).*(login.php)\s(HTTP).*-\s(\d{3})"
for line in data : 
    matches = re.findall (pattern, line)
    if len(matches) &gt; 0 : 
        login = []
        login.append(matches[0][0])
        login.append(matches[0][1])
        login.append(matches[0][2])
        login.append(matches[0][3])        
        
        login_data.append(login)
        
# print a sample 
for line in login_data[0:5]:
    print ( line)
# calculate the success ratio
count_200 = 0       # succesful
count_not_200 = 0   # unsuccesful

for element in login_data : 
    if element[3] == "200": 
        count_200 += 1
    else : 
        count_not_200 += 1

percentage_success = ( count_200 / (count_200 + count_not_200) ) * 100
print ( "Login page was succesfully hit ", percentage_success, "% of the time")
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-192-2b3d022c0101&gt; in <module&gt;
      3 count_not_200 = 0   # unsuccesful
      4 
----&gt; 5 for element in login_data :
      6     if element[3] == "200":
      7         count_200 += 1

NameError: name 'login_data' is not defined
text = '''Data science is an inter-disciplinary field  that uses scientific methods, processes, algorithms and systems to extract knowledge and insights from structured and unstructured data. Data science is related to data mining and big data.

Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. It employs techniques and theories drawn from many fields within the context of mathematics, statistics, computer science, and information science. Turing award winner Jim Gray imagined data science as a "fourth paradigm" of science (empirical, theoretical, computational and now data-driven) and asserted that "everything about science is changing because of the impact of information technology" and the data deluge. In 2015, the American Statistical Association identified database management, statistics and machine learning, and distributed and parallel systems as the three emerging foundational professional communities.'''
pattern = re.findall(".{10}\\s{2,}.{10}",text)
print ( pattern )

Quiz – The pattern above can be used to find out

  • words that have 2 or more spaces in between them
  • sentences that have 2 or more space in between them

Credit cards extractor

Problem – Find credit card numbers identified by their category (Visa, Master Card ). These credit card numbers follow a certain pattern. Use the following pattern to identify the category.

Patterns

  • All visa card numbers start with a 4 and are either 13 or 16 numbers
  • All Master card numbers start with 51 through 55 or 2221 through 2720 and have exactly 16 digits
  • All Amex cards start with 34 or 37 and have exactly 15 digits
text = '''4018-2345-0303-0339,
          5335-6092-0182-9739,
          4076-2929-0000-2222,
          3777-5074-1547-439,
          5451-3970-1507-5040,
          3425-2515-2514-202,
          3752-2681-2429-924,
          4004-4759-3761-924,
          2228-2545-5555-2542,
          2296-2542-2587-2555,
          2321-2547-5145-2222,
          2650-2545-2222-5555,
          2706-2546-2589-2515,
          2713-9874-5263-6253,
          2720-2541-3256-6985
'''


Solution

Let’s see if the following solution works for Visa cards.

visa_matches = re.findall("4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9][0-9]{3}?", text)
print ( visa_matches )


The last card is not being picked up. Let’s wrap the last 3 digits in a group and try it.

visa_matches = re.findall("4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9]([0-9]{3})?", text)
print ( visa_matches )


ooh.. this time it only picks up the group. But we wanted the entire number, right ? There are a couple of options.

  • We can either put all of the remaining pattern also into groups.. like so
visa_matches = re.findall("(4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9])([0-9]{3})?", text)
print ( visa_matches )

  • or, we can let the last element in the pattern to not be a non-capturing group – meaning, it will still be a group from a syntax perspective, but will not be captured as a group. To do that, we use ?:.
visa_matches = re.findall("4[0-9]{3}-[0-9]{4}-[0-9]{4}-[0-9](?:[0-9]{3})?", text)
print ( visa_matches )

Whenever we use ?: at the beginning of the group, it will be used to capture the pattern, but will not be captured into the group. Now, let’s work on master card.

Master card represents a different pattern. It has a pretty broad range of numbers – The beginning numbers start with

  • 51 through 55 OR
  • 2221 through 2720

The first one is easy enough. Let’s work on that first.

mc_matches = re.findall("5[1-5][0-9]{2}-[0-9]{4}-[0-9]{4}-[0-9]{4}", text)
print ( mc_matches )


The range 2221-2720 cannot be specified that easily. We need a different strategy for that. We can split this range as follows.

  • 2221-2229
  • 223x-229x ( 2230 to 2299 )
  • 23xx-26xx ( 2300 to 2699 )
  • 270x-2709 ( 2700 to 2709 )
  • 271x-2719 ( 2710 to 2719 )
  • 2720

We need to code all these patters using an OR operator.

mc_matches = re.findall("222[1-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
mc_matches = re.findall("22[3-9][0-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)

mc_matches = re.findall("2[3-6][0-9]{2}-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
[]
mc_matches = re.findall("270[0-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
[]
mc_matches = re.findall("271[0-9]-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)

[]
mc_matches = re.findall("2720-[0-9]{4}-[0-9]{4}-[0-9]{4}",text)
print(mc_matches)
[]

In all these examples, the first four digits are the ones that are different. The pattern for the rest of the 12 numbers remain the same. So, let’s compress all of these into an OR based patter for the first 4 digits and let the remaining 12 digits remain constant.

mc_matches = re.findall ("(?:5[1-5][0-9]{2}|222[1-9]|22[3-9][0-9]|2[3-6][0-9]{2}|270[0-9]|271[0-9]|2720)-[0-9]{4}-[0-9]{4}-[0-9]{4}", text)
print ( mc_matches)

[]

There we go – that should cover all the possible combinations of Master cards. Now, lets move on to Amex. Amex has a really simple pattern –

  • All Amex cards start with 34 or 37 and have exactly 13 digits
amex_matches = re.findall("3[47][0-9]{2}-[0-9]{4}-[0-9]{4}-[0-9]{3}", text)
print ( amex_matches )
[]

Learning

  • (?:…) is used to represent non-capturing groups – meaning, they will be used to identify patterns, but the specific pattern within that paranthesis will not be captured as a group. We have seen how this will be useful in case of Visa pattern.
  • Cycling through a range of numbers. We have seen how to cycle through a large range of numbers when we discussed the pattern for Master card.

Word occurrences

Problem – Find all the occurances of a word in a text and segregate them into 2 categories

  • 1 – standalone occurrence of the word
  • 2 – The word is part of another word.

For example, the word “bat” can occur in isolation , like in the sentence (“His cricket bat is awesome”), or as part of a different word , like in (“Aeriel combat vs land-based combat”).

Solution – Finding the pattern is quite easy. However, the trick is to find out if the word occurred individually or is part of another word. word boundaries can help in this case. \b is used to specify a word boundary.

text = '''Python is a general purpose programming language. Python's design philosopy ...
          Let's pythonify some of the code...
          while python is a high level language...'''
matches = re.findall("\bPython\b", text)
print ( matches )

[]

But it is not working. Why ? That is because \b is a special escape character for backspace. When you specify that in the pattern string, it is not treated literally, but interpreted as a backspace. To avoid confusion, always use raw strings to define patterns. Raw strings can be specified in Python by prepending the string with a r. Let’s try this again.

matches = re.findall(r"\bPython\b", text)
print ( matches )

['Python', 'Python']

That’s better. However, there are 3 occurrences of “Python”. Why is python in the last line not being picked up ? That is because, regular expressions are case sensitive. The “p” in the word “python” in the last line is lower case. If you wanted to do a case sensitive search, use global flags. These can be specified as a third parameter in the findall ( ) function.

matches = re.findall(r"\bPython\b", text, re.IGNORECASE) # you can also use re.I as a shortcut
print ( matches )

['Python', 'Python', 'python']

Learning

  • \b is used to specify a word boundary.
  • Always prepend the pattern string with r to make it a raw string. This way, the characters in the pattern are taken literally.
  • global flags can be used to alter the way regular expressions work. One of the global flags we have seen is re.IGNORECASE. It can be used to do a case insensitive search.

HTML tags

Problem – Find all the tags in a HTML or XML.

For example, here is a small snippet of HTML. There are many tags like , , etc. We have to identify all the tags used in the following HTML.

import re 
text = '''<html&gt;
            <head&gt; 
            <title&gt; What's in a title</title
            </head&gt;
            <body&gt;
                <tr&gt;
                    <td&gt;text one </td&gt;
                    <td&gt;text two </td&gt;
                </tr&gt;
            </body&gt;
          </html&gt;
'''

Solution

matches = re.findall(r"<([^&gt;]*)&gt;", text)
print ( matches )
['html', 'head', 'title', '/title\n            </head', 'body', 'tr', 'td', '/td', 'td', '/td', '/tr', '/body', '/html']

This gives the start and end tags. Say, we don’t want the end tags.

matches = re.findall(r"<([^&gt;/]*)&gt;", text)
print ( matches )
['html', 'head', 'title', 'body', 'tr', 'td', 'td']

Another way to do it is to use non-greedy quantification. When you start a pattern with < and consume any character with ., it consumes it all the way to the end. That is why and + are greedy quantifiers. To negate the effect of it, use the ? operator. That way it allows the * to match the least amount of text before the regular expression is satisfied.

import re

matches = re.findall("<.*?&gt;", text)
print ( matches )
['<html&gt;', '<head&gt;', '<title&gt;', '</head&gt;', '<body&gt;', '<tr&gt;', '<td&gt;', '</td&gt;', '<td&gt;', '</td&gt;', '</tr&gt;', '</body&gt;', '</html&gt;']

Learning

  • and * are greedy quantifier. They consume the most amount of text before a pattern can be satisfied.

title

Problem – as

text = '''
ages = [22, 23, 24]
9_age = [22, 23, 24]
'''

matches = re.findall("[a-zA-Z_][a-zA-Z_0-9]{1,20}", text)
print ( matches)
['ages', '_age']
matches = re.findall("[0-9][a-zA-Z_0-9]{1,20}", text)
print ( matches)
['22', '23', '24', '9_age', '22', '23', '24']

Password verification

Problem – Verify if a password is

  • Has atleast One upper case character
  • Has atleast one digit.
  • Has atleast one special character ( let’s limit special characters to @ , # , $ , % )
text = '''Aw@som$passw0rd
          Awesomepassw0rd
          Awesomepassword
          Aw!som!passw0rd
          aw!som!passw0rd'''

# All combinations except for the first one is valid


Solution

This can be solved easily using regular python lists. However, we wanted a more concise solution using regular expressions. In these kinds of situations, we are looking for some kind of validation. Regular expression’s lookaround function is very useful in these cases. The syntax for that is (?=…) where … represents any regular expression. Let’s start with the first condition


matches = re.findall("[^A-Z]*[A-Z].*", text)
print(matches)
['aWesome']
text = "Awesome1"
matches = re.findall("[^A-Z]*[A-Z]\D*\d.*", text)
print(matches)

['Awesome1']

This works too. Now, lets try a different combimation – put the digits before the letters.

text = "1Awesome"
matches = re.findall("[^A-Z]*[A-Z]\D*\d.*", text)
print(matches)
[]

That failed. why ? Because regular expressions consume text and move forward. So, the expession [^A-Z]*[A-Z] consumed all the text including the 1 at the beginning. And it is now looking for a number at \d, which it cannot find after the capital letter. This is where lookarounds help.

text = "1Awesome"
matches = re.findall("(?=[^A-Z]*[A-Z])(?=\D*\d).*", text)
print(matches)
['1Awesome']

This time it works. The reason is that we have converted the digit search \D*\d into a lookahead(?=\D\d).

An important aspect of lookarounds ( look ahead or look behind ) is that it does not consume any characters. For example, look at the example below. We want to find out all the words that are preceded by a comma, but we don’t want to show the comma.

text = "Hi there, how are you doing ?"

# \b for word boundary
# \w+ for a word
#(?=,) will ensure that the word is followed by a comma

matches = re.findall(r"\b\w+(?=,)", text)
print ( matches )
['there']

See, the comman is not shown in the output. Granted it is not a big deal. We can do that using groups. However, there are many situations (like the password example above) that cannot be achieved using groups. That’s where lookarounds come in. Let’s continue the same example as above and find out all the words, preceded by a comma.

# (?=,\s) =&gt; verify (assert) that before the word, there is a comma followed by a space
# \w+ is a word

matches = re.findall(r"(?<=,\s)\w+", text)
print ( matches )

['how']

Learning

  • There are 2 type of Lookaroundslook ahead and look behind.
  • (?=…) is used to do look ahead search.
  • Lookarounds are also called assertions

Regular expressions cheatsheet

Special CharacterDescription
.Matches any character – except new line
[XYZ]Character set
[^XYZ]Negation of the Character set
[A-Z]Matches any character – except new line
pipeLogical OR
.Matches any character – except new line
\wMatches any word character. Equivalent to [A-Za-z0-9_]
\WNegation of any word character. Equivalent to [^A-Za-z0-9_]
\dMatches any digit. Equivalent to [0-9]
\DMatches any non-digit. Equivalent to [^0-9]
\sMatches any whitespace character ( spaces, tabs or linebreaks )
\SMatches any non-whitespace character
^Matches beginning of line
$Matches end of line
\bWord boundary
\Bnot a word boundary
*Zero or more
+One or more
?Zero or one
(XYZ)Capturing group
(?:XYZ)non-capturing group
(?=XYZ)Positive lookahead
(?!XYZ)Negative lookahead
(?<=XYZ)Positive lookbehind
(?<!XYZ)Negative lookbehind

Challenge

Extract data to JSON

Say, we gave a bunch of cities along with their nick names in the following format in a text file. Extract the city and it nick name in a JSON format with the structure as follows.

cities = '''
1. Paris – The City of Love, The City of Light, La Ville-Lumiere

2. Prague – The City of Hundred Spires, The Golden City, The Mother of Cities

3. New York – The Big Apple

4. Las Vegas – Sin City'''

# required output format
{
    "city_1" : ["nick name 1", "nick name 2"], 
    "city_2" : ["nick name 1", "nick name 2.."]
}

#import the file

with open("./data/cities.txt","r") as f : 
    data = f.read()

import re 

matches = re.findall("\d+\.\s+(\w+\s?\w+)\s+–\s+(.*)", data)
print ( matches[0:5])
import json

city_dict = {}

for city in matches : 
    city_dict[city[0]] = city[1].split(",")
    
city_json = json.dumps(city_dict)

What is a lookup ?

Lookup is one of the most commonly used functions in excel. Let’s understand what it is with an example. 

Say, we have orders data from an e-commerce website like so. 

What we have are the list of products and its prices. Say, somebody gives you a subset of these products and tells you to pick up the corresponding prices. How would you do it ? Of course you can do it manually if you are just talking about a couple of items. However, once you have a handful of items, you need a method for it. Let’s see how we can do it using VLOOKUP formula in excel. 

For example, we want to find out the price of 3 items that you see on the right. If we want to get them using VLOOKUP, this is how we do it. 

The VLOOKUP formula looks like a big deal, but it is actually not. We will look at the actual parameters in the next chapter, but first I want you to understand what VLOOKUP can do. In the example above, we were able to pull the prices of 3 items. What if there were a million items in the lookup data table and you are trying to find out the prices of 2000 items from it ? As you very well know, this cannot be done manually – This is where VLOOKUP works its magic. 

How VLOOKUP works

Now that we understand what VLOOKUP can do for us, let’s understand how VLOOKUP works at a high level. There are 4 parts to the VLOOKUP formula. 

  1. Lookup Data ( or lookup table )
  2. Item or key to be looked up
  3. Column number in the lookup data that needs to be extracted
  4. Exact vs approximate match

The following slideshow highlights each of these with the corresponding visuals. What we are trying to do here is to look up the price of item code 22697. 

What we have actually done is to just get the prices of 3 item codes. It might look trivial, but imagine the following data – A half a million rows of online retail data from a real store. 

Now, say you are given 1000 item codes and asked to look up the price. Imagine what a pain that would be to do manually. VLOOKUP can do this in a split second. 

VLOOKUP across tabs

VLOOKUP not just works within a tab. It can work across tabs. Let’s take the same Online Retail product price list with half a million records. Let’s take a subset of these items, say 500 of them and now, we want to look up the prices of these items. 

We could do the look up in the same tab, but that makes it clumsy. Instead, let’s do it in a separate tab. To reference the lookup data in a separate tab, all you have to do is to prefix the tab name – like so. 

Approximate vs Exact Match

So far, we have been doing Exact matches off the key with the lookup data. Sometimes, we might have to do an approximate match. Approximate match is NOT a wildcard match as you might expect. We will deal with wildcard matches in a seperate section down below. Let’s understand what approximate match does with an example. 

Imagine a sales team with different commision percentages. 

Now, say we have 4 sales persons with different sales volumes and we want to find out the percentage commision each of them should get. 

You see, in this case, the sales volume does not EXACTly match the slabs. Instead, it is a bracket. In cases like this, you can use the approximate match feature of VLOOKUP.

The moment you set the match option to TRUE ( which also happens to be the default option ), an approximate match happens (as opposed to an exact match). You can see the results for yourself. Although the sales numbers on the left don’t exactly match with the slabs on the right, VLOOKUP is smart enough to identify which slab each of the sales volumes fall under and matches the percentage accordingly.

HLOOKUP

While VLOOKUP is the most used function (among VLOOKUP & HLOOKUP), Excel gives you the option to do HLOOKUP or Horizontal lookup as well. Look at the same Retail example flipped horizontally below. 

The data is exactly the same, except it is set up horizontally – hence HLOOKUP. 

SQL Equivalent of VLOOKUP

If you have some exposure to SQL, VLOOKUP works like a SELECT statement with a LIKE clause. The equivalent of the above VLOOKUP in SQL would be something like this

SELECT item_desc, price FROM <lookup table> 

WHERE item_number IN (<list of item numbers>) 

#N/A error

What if the item that is looked up doesn’t exist ? Excel throws a #N/A error. 

Number as Text

Sometimes, when the key in the lookup data is formatted as text, the lookup does not work and would result in an error. 

Select the data, right click on it and click on “Convert to Number”. That should fix the problem. 

VLOOKUP works Left to Right only

VLOOKUP always works Left to Right – meaning, the key should always be the first element in the lookup data. 

Wildcard or Partial Matches

Sometimes, you might do text based matches. For example, in the example below, you might want to extract the prices using item description, which is a text field ( as opposed to item code, which is numeric). In cases like this, you might not get the text exactly right. 

That’s when you can use wild card search. Just append the search key with a “*” (asterisk) and you should be good to go. 

LOOKUP is case insensitive

Text based lookup is case insensitive. You can try it by changing cases in the example sheet above. 

Merge data using VLOOKUP

Merging data from multiple sources ( typically separate tabs or separate sheets) is one of the use cases for VLOOKUP. The only constraint being that the data should be bound by a common key. 

The simple example above shows data being merged from 2 look up tables. You can extend the same idea across multiple tabs or sheets. All you have to remember is that there should be a common key across the data being merged. 

Named Ranges

The way number ranges are specified in excel when VLOOKUP is being used is a bit ugly looking. You can use named ranges when doing lookups. Instead of specifying lookup data using rows and columns, you can very well used named ranges. For example, say we specify a name for the lookup data like this.. 

Now that the range has been given a name, you can use this to do lookups. 

VLOOKUP in Google Sheets

VLOOKUP works perfectly well in Google sheets as well. The procedure is exactly the same. 

Subplots

Subplots


  Data visualization

Contents

Elements of a plot

Figure and Axes

Before getting into the subplots topic, let us first discuss about the two main elements in a plot – Figure and Axes.

Figure

A Figure object can be thought of as a window on which plots are rendered and it contains all the plotting elements.

Axes

The Axes is the actual plotting area contained within the figure object. There can be multiple axes objects in a figure object. The Axes contains the x-axis,y-axis,data points,lines,ticks,text etc.,

Interfaces

The Matplotlib library provides two important interfaces for rendering plots:

1. State-machine interface
2. Object Oriented interface


What is a state-machine?

A finite state machine is an abstract concept wherein a system can exist in a finite number of states. But at any given point of time, it can be in exactly one of these states. Depending on the operations performed on the state machine, it can change from one state to another.

State-machine interface

The figure on which plots are rendered behaves like a state-machine system. The state-machine interface keeps track of the current figure, axes and the current state of the figure and hence is called stateful interface. The state-machine interface makes calls to various methods using the pyplot module and it is also known as Pyplot interface. When a method is called using the pyplot module, a figure and axes objects are created implicitly and changes are made to the figure, hence changing the state of the figure. We do not define any objects or variables when using this interface. We simply call methods defined in the pyplot module and the changes appear in the figure.

The pyplot interface is useful to render simple plots with minimum code or for quickly testing our code, because we can issue a command and immediately see the result. But if we are working on multiple plots it becomes difficult to keep track of the active figure and the plots.

Object-oriented interface

Matplotlib also supports the object oriented interface or the stateless interface. This interface is called stateless because instead of creating a global figure instance, a figure object is created and referenced with a variable. We can then directly call methods on this variable to change the different elements in a plot. Because we are storing a reference to the plotting objects in variables, it is easy to track the objects that we are working on. The object-oriented interface is a more pythonic and recommended way to use Matplotlib.

The pyplot interface is infact built on top of the object-oriented interface. The top layer is the front-end interface that is used by users without much programming experience to test simple plots. Calling methods such as line,scatter,bar,hist etc., from pyplot module, creates a figure instance implicitly and generates the respective plots. The bottom layer can be used to develop complicated visualizations as it provides more control and allows for customization of the plots.

A figure object can be created using pyplot. In the object oriented interface as well, pyplot is used to create the figure and the axes objects. In order to render the plots, we then call the various methods directly on these objects.

Subplots

Subplot method

Up until now, we have seen various plots such as line plots, bar charts, pie charts, scatter plots etc., In many examples we have generated one plot on one figure, we have also seen examples where we generated multiple plots on one figure. What if you are required to have plots side by side or one plot on top of the other in a figure? This can be achieved with subplots. A subplot is a plot that takes up only a part of the figure object. In Matplotlib,a subplot is also referred to as an axes. We will now see how to generate subplots using the object oriented approach.

The subplot method is used to divide a figure into multiple subplots: subplot (m, n, index, kwargs)**

The subplot method returns an axes object and is used to divide a figure into a matrix of m rows and n columns, creating m*n subplots in one figure. The index number provides the location of the current axes for plotting. The index number starts from 1 and the subplots are numbered from left to right beginning from the first row upto the mth row.

from matplotlib import pyplot as plt
import numpy as np
fig1 = plt.figure(num=1,figsize=(10,6))
ax1  = plt.subplot(111)
x = np.arange(1,10)
y = np.arange(1,10)
ax1.plot(x,y,color='c',marker='p')
plt.title('Subplot_1')
plt.show()


Subplots method

In order to plot using the subplot function, the given subplot has to be first set as the current/active axes for plotting by specifying the index number and this becomes tedious if there are too many subplots. Matplotlib also provides the subplots method which returns a figure object and a numpy array of axes objects.

subplots(m=1, n=1, sharex=False, sharey=False, squeeze=True, subplot_kw=None, gridspec_kw=None, fig_kw)**

fig2, ax = plt.subplots(2,1,sharex='col')
x = np.arange(1,11)
y = np.arange(1,11)
z = np.random.randn(10)

# Subplot1
ax[0].plot(x,y,color='g',marker='p',label='axes1')
ax[0].set_title('Axes1')

# Subplot2
ax[1].plot(x,z,marker='o',label='axes2')
ax[1].set_title('Axes2')

plt.tight_layout()
plt.show()
fig2, ax = plt.subplots(2,2,sharex='col',sharey='row')
x = np.arange(1,11)
y = np.arange(1,11)
z = np.random.randn(10)

#Subplot1
ax[0][0].plot(x,y,color='g',marker='p',label='axes1')
ax[0][0].set_title('Axes1')
ax[0][0].legend()

#Subplot2
ax[0][1].scatter(x,y,color='k',marker='^',label='axes2')
ax[0][1].set_title('Axes2')
ax[0][1].legend()

#Subplot3
ax[1][0].plot(x,z,marker='v',label='axes3')
ax[1][0].set_title('Axes3')
ax[1][0].legend()

#Subplot4
ax[1][1].scatter(x,z,marker='o',label='axes4')
ax[1][1].set_title('Axes4')
ax[1][1].legend()

plt.tight_layout()
plt.show()

Add axes method

You can also specify the position and the dimensions of an axes using the add_axes method.

add_axes([left, bottom, width, height])

A list is passed to this method containing the elements:

  • left, bottom (coordinates of lower left corner of the axes).
  • width, height of the axes object.

The values for all the 4 parameters range in between 0 and 1, representing the fraction of the figure that the axes occupies.

fig3 = plt.figure()
ax3 = fig3.add_axes([0.1,0.1,0.8,0.8])
inset_ax = fig3.add_axes([0.15,.6,0.2,0.2])
ax3.plot(x,y)
inset_ax.plot(x,y)
plt.show()

In the above example, we have a subplot enclosed in another subplot.

Histogram

Histogram


  Data visualization

Table of Contents

Histogram

Histogram

A frequency distribution is a table that shows the number of times distinct values in a dataset occur. Histograms are used to evaluate the frequency distribution of a given variable by visually displaying the number of data points occurring in a certain range of values, histograms are useful when there are large datasets to analyse. Similar to a bar graph, data in a histogram are represented using vertical bars or rectangles. So a histogram appears similar to bar graphs, but the bars in a bar graph are usually separated whereas in histograms the bars are adjacent to each other.

Say for example, you are conducting an experiment and you want to visually represent the outcome of the experiment. In this experiment, you are rolling two dice 1000 times, the outcome of each event is recorded by appending the outcome to a list.

outcomes = [1,5,6,3,3,2,4,1,6,11,12,10,7,8,9,12,11,…]

If you want to see the pattern of the outcomes, it is difficult to analyse the list. We can visualize the pattern by generating a histogram showing the frequency of occurrence for the sum of two dice rolls. Histograms are useful for displaying the pattern of your data and getting an idea of the frequency distribution of the variable. To plot a histogram, the entire range of input dataset is split into equal sized groups or bins. A bar is drawn for each bin with the height proportional to the number of values in the input data that fall under the specified bin.

Plot a Histogram with random numbers

The histogram below is plotted with random numbers using the ‘hist’ function defined in the pyplot module. The rand function defined in the numpy library creates an array of specified shape and fills it with random numbers from 0 (included) to 1 (excluded).

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
input_data = (np.random.rand(10**3))
plt.hist(input_data,bins=50,color='r',alpha=0.4)
plt.title('Histogram')
plt.xlabel('bins')
plt.ylabel('frequency')
plt.show()

In the above example, the rand function has generated 1000 random numbers and using the ‘hist’ function these random numbers are distributed in 50 different bins. It can be observed from the above histogram that the distribution of random numbers is more in some bins than the other bins. You can generate random numbers to the order of 10^4,10^5,10^6 and see how the values are distributed.

Plot a Histogram to analyze Airline On-time performance

The U.S. Department of Transportation’s (DOT) – Bureau of Transportation Statistics (BTS) releases a summary of statistics and basic analysis on airline performance each month. This dataset is a summary of different air carriers showing their departure delays, arrival delays, scheduled departure, etc. Let us analyse the flight data released by BTS. For this example, I have downloaded data from the following website – (https://transtats.bts.gov/ONTIME/Departures.aspx) into a csv file. This data is collected at JFK International Airport for American Airlines carrier during Jan’19.

Let us plot a histogram which shows the distribution of departure delays(in minutes) of all flights. The delay in departure is calculated as a difference in minutes between scheduled and actual departure time. In the input dataset, early departures are represented as negative numbers and on-time departures are represented with a zero.

import csv
from matplotlib import pyplot as plt
with open (r'C:\Users\Ajay Tech\Documents\training\visualization\Data\flight_delay_american.csv') as input_file1:
    csv_file = csv.reader(input_file1)
    header = next (csv_file)
    delay_min = []
    for row in csv_file:
        delay_min.append(int(row[5]))

bins = [-50,0,50,100,150,200,250,300,350,400,450,500,550,600,650,700,750]
plt.hist(delay_min,bins=bins,log=True,color='c')    
plt.axvline(np.mean(delay_min), color='r', linestyle='dashed', linewidth=1)
plt.title('Histogram of Departure Delays(AA)')
plt.xlabel('Delay(min)')
plt.ylabel('No of flights')
plt.xticks(bins,rotation=30)
plt.show()

In the above script the yscale is set to log scale instead of normal scale because log scale allows us to visualize variations that would otherwise be barely visible. We have marked the average departure delay time on the histogram with a vertical reference line drawn using the axvline function. The axvline function plots a line across the x-axis which can be used to highlight specific points on the histogram. The dotted vertical line on the histogram indicates that on an average, the American Airlines flights departing from JFK airport took off 7 minutes late in Jan’19.

Let us also see the performance of another carrier at JFK airport for the same period.

with open (r'C:\Users\Ajay Tech\Documents\training\visualization\Data\flight_delay_jetblue.csv') as input_file2:
    csv_file = csv.reader(input_file2)
    header = next (csv_file)
    delay_min = []
    for row in csv_file:
        delay_min.append(int(row[5]))

bins = [-50,0,50,100,150,200,250,300,350,400,450,500,550,600,650,700,750]
plt.hist(delay_min,bins=25,log=True,color='b',alpha=0.3)    
plt.axvline(np.mean(delay_min), color='r', linestyle='dashed', linewidth=1)
plt.title('Histogram of Departure Delays(JB)')
plt.xlabel('Delay(min)')
plt.ylabel('No of flights')
plt.xticks(rotation=30)
plt.show()

The vertical line drawn using the axvline function indicates that the average departure delay time for JetBlue Airways flights flying out of JFK is 14 minutes. In fact, JetBlue Airways was named as the most-delayed airline at JFK airport.

Bar Charts

Bar Charts


  Data visualization

Contents

Bar Chart

A bar chart represents data values in the form of vertical bars. Each vertical bar in the graph represents an individual category. The bars are used to compare values in different categories. In a bar chart, the length of a bar is proportional to the value it represents and the width remains same for all bars. One axis of the chart represents categories and the other axis represents the value scale.

Below we see a bar chart plotted using the bar function defined in pyplot module. The bar chart displays cars sales during a ten year period for an automobile company. The first argument to the bar function indicates the position of the bar on the x-axis with the center at the x-tick position. The second argument indicates the height. The width of each bar is 0.8 which is the default setting, and this can be changed using the ‘width’ parameter.

from matplotlib import pyplot as plt
import numpy as np
year = [2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]
toyota_sales = [1843669,1496211,1488588,1396837,1764802,1893874,2004373,2098545,2106332,2129177,2128201]
x_pos = np.arange(len(year))
plt.bar(x_pos,toyota_sales,color='#623aa2',alpha=0.25,edgecolor='k',label='Toyota')
plt.xticks(x_pos,year,rotation=30)
plt.title('Toyota Car sales')
plt.xlabel('Year')
plt.ylabel('No of units sold')
plt.show()

Clustered Bar Chart

A clustered or a grouped bar chart is used to compare multiple data sets side by side. Say, you want to compare values of multiple datasets that come under the same category, then a clustered bar chart comes in handy. The previous example can be extended to display car sales of different automobile companies.

In the previous example, we used a bar chart to display the sales of an automobile company for a ten year period. Now we would like to compare the sales of three different companies for the same period. So we are going to have three vertical bars under each category, each bar representing a company. In order to differentiate the three datasets we use different colors for the bars.

year = [2008,2009,2010,2011,2012,2013,2014,2015,2016,2017,2018]
hyundai_sales = [401742,435064,538228,645691,703007,720783,725718,761710,768057,664943,667634]
honda_sales = [1284261,1045061,1096874,1023986,1266569,1359876,1373029,1409386,1476582,1486827,1445894]
toyota_sales = [1843669,1496211,1488588,1396837,1764802,1893874,2004373,2098545,2106332,2129177,2128201]
x_pos = np.arange(len(year))
width = 0.25
plt.bar(x_pos-width, toyota_sales, edgecolor='k',color='#E59998',width=width,label='Toyota')
plt.bar(x_pos, honda_sales, edgecolor='k',color='#FAD8AA',width=width,label='Honda',alpha=0.75)
plt.bar(x_pos+width, hyundai_sales, edgecolor='k',color='#538790',width=width,label='Hyundai')
plt.xticks(x_pos,year,rotation=30)
plt.title('Car sales')
plt.xlabel('Year')
plt.ylabel('No of units sold')
plt.legend()
plt.show()

Horizontal Bar Chart

Horizontal Bar Charts represent data in the form of horizontal bars, each bar representing an individual category. The data categories are shown on the y-axis and the data values are shown on the x-axis. The length of a bar is proportional to the value it represents.

The example below demonstrates how to plot a bar chart, the input datasets required for plotting are available in a csv file. We will import the built-in csv module to work with csv files.

import csv
with open (r'C:\Users\Ajay Tech\Desktop\air_pollution_index.csv') as input_file:
    csv_file = csv.reader(input_file,delimiter = ',')
    Header = next(csv_file)
    country = []
    index = []
    for row in csv_file:
        country.append(row[0])
        index.append(row[1])

plt.bar(country,index,color='#ff753e') 
plt.title('Air Pollution Index')
plt.xlabel('Index')
plt.show()

The above graph is plotted using the bar function. As can be observed from the figure the x-axis labels are overlapping with each other because the labels are too long. This problem can be solved using a horizontal bar chart, which makes optimal use of the space available. If the data labels are long or if you have too many data sets to plot, then horizontal bar charts can be used for plotting.

plt.barh(country,index,color='#ff753e') 
plt.title('Air Pollution Index')
plt.xlabel('Index')
plt.show()

Pie Chart

Pie Chart


  Data visualization

Contents

  • Pie Chart

Pie Chart

A pie chart is a circular chart divided into segments. The segments of a pie chart are called wedges. Each wedge represents an individual category. Pie charts display the contribution of each wedge to the total value and the sum of the values of all wedges should be equal to the whole circle. Pie charts are useful when there are few categories to compare(5 or less) else it becomes difficult to interpret the data.

Below we see a pie chart plotted using the pie function defined in pyplot module. The pie chart displays the percentage of marks obtained by a student in four subjects. The circle is divided into four wedges and the area of each wedge is proportionate to the value it represents.

from matplotlib import pyplot as plt
plt.style.use('default')
x = [22,18,13,10]
labels = ['maths','physics','chemistry','english']
colors = ['m','r','y','c']
plt.pie(x,labels=labels,colors=colors,autopct='%.1f%%')
plt.title('Marks obtained in an exam')
plt.show()

The pie chart below displays the market share of mobile phone vendors worldwide.

plt.style.use('default')
x = [33.71,19.37,4.85,3.82,7.42,30.83]
labels = ['Samsung','Apple','Huawei','LG','Unknown','Others']
colors = ['m','r','y','b','g','c']    
plt.pie(x,labels=labels,colors=colors, explode=[0,0,0,.2,0,0],autopct='%1.2f%%',startangle=45,counterclock =True,shadow=True,
        wedgeprops={"edgecolor":"0",'linewidth': 1,'linestyle': '-'})
plt.title ('Smartphone market share')
plt.axis('equal')
plt.legend(loc=2)
plt.show()

x — The first argument passed to the pie function is an array or a list denoting the values for the categories to be compared.

Labels — The labels argument is a list of strings used for labelling each wedge.

Colors — You can define an array/list of colors and then pass it to the pie function that will be applied to each wedge in pie chart in the order specified in the array.

Explode — If you want to highlight or emphasize key data in a pie chart use the explode parameter. The explode parameter explodes/expands a wedge, so the wedge is moved slightly outward from its center. This parameter accepts an array and each element in the array specifies by what fraction of the radius the wedge needs to be exploded. The value has to be defined for all wedges in the pie chart, so the length of the array should be equal to the number of wedges in the pie chart.

Autopct — If you want to label the wedges with their numeric value in a pie chart, then use autopct parameter. This parameter allows us to display the percent value using string formatting. Say for example, the percent value calculated for a segment is 34.678666 and if you want to display the percent value rounded to 1 decimal place then autopct parameter should be assigned the format string ‘%1.1f’ then the wedge will be labelled with the numeric value 34.6. If you want to add a percent sign (%) to the label then use two percent signs(%%) in the format string so that the special character ‘%’ escapes itself.

Startangle — By default the Startangle is zero, which means starting from the positive x-axis the wedges are arranged in the counter clock wise direction. If you specify a different Startangle then the start of the pie chart is shifted by this angle in degrees and then the wedges are arranged in counter clock wise direction from this position.

Counterclock — Specifies the direction in which the wedges are arranged, clockwise or counter clockwise. The default value is True.

Shadow — A shadow effect can be added to the pie chart using the shadow parameter of the pie() function, passing boolean value – True will make a shadow appear below the pie chart. By default shadow is turned off.

Wedgeprops — The wedges of the pie chart can be customized using the wedgeprop parameter. A dictionary with the property name and value as the key, value pairs can be passed as the wedgeprop argument. The wedge properties like edgecolor, linestyle, linewidth can be specified.

Line Plots

Line Plots


  Data visualization

Contents

What is a Line Plot?

The pyplot module in matplotlib supports a variety of plots such as – line plot, pie chart, bar chart, histogram, scatter plot etc., The module defines methods that are used to render various plots. In this tutorial, we will discuss about line plots.

A line plot is created by connecting the values in the input data with straight lines. Line plots are used to determine the relation between two datasets. A dataset is a collection of values. Each dataset is plotted along an axis ie., x and y axis. In order to draw a line plot, we call the plot function defined in pyplot module. We pass two arguments (arrays or lists) to the plot function, the first argument denotes the x-coordinates, second argument denotes the y-coordinates. The plot function plots the data points (x1,y1), (x2,y2) and so on defined in the input datasets and by default, draws a line between these data points. Before drawing a plot, let us the see the components that make up a basic plot.

Components of a basic plot

A basic plot is made up of the following components:

  1. Title – Title describes the information that we want to convey using the graph.
  2. Label – Label is a short description of the datasets being plotted.
  3. Scales – Scales determine the reference points for data displayed on the graph.
  4. Points – Points in a graph represent the input data in the form of x-coordinate and y-coordinate (x1,y1).
  5. Lines – Lines are used to connect points to highlight the change in values.

Plotting a graph

Matplotlib makes extensive use of the Numpy library which contains a number of mathematical functions which can be used to perform various mathematical operations. We need to import Matplotlib and Numpy libraries before making any calls to the routines defined in them. The below example demonstrates creation of a line plot by passing two numpy arrays x and y as arguments to the plot function.

from matplotlib import pyplot as plt
import numpy as np
x = np.linspace(0,10,10)
y = np.linspace(0,10,10)
plt.title('First Plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.plot(x,y,marker='o')
plt.show()

In order to render the above plot, we simply passed two arrays x,y to the plot function. We can see that the plot function has:

  1. Drawn the x and y axis.
  2. Marked evenly spaced scales(tick marks) on both the axes.
  3. Plotted the data points.
  4. Connected the data points with lines.
  5. Added the title, xlabel, ylabel.

Before executing the above mentioned steps, the plot function first creates a figure object. Figure objects are the individual windows on the screen in which Matplotlib displays the graphical output. It is a container for the graphical output. In the Jupyter NB, figures rendered by the Matplotlib library, are included inline. The plot function implicitly creates a figure object and then plots the graph. So, we do not have to call any other function to instantiate a figure object to render a plot when using the plot function. The standard size of a figure object is 8 inches wide by 6 inches high.

Say, for example, we have a requirement to create a figure with a specified size (4 inches wide, 4 inches high). For this, we need to call the Figure method defined in the pyplot module explicitly. The ‘figsize’ parameter of this method allows us to specify the width and height of a figure in unit inches and new a figure will be created. In order to render a plot, call the plot function.

The savefig() method saves the figure to a data file with a name specified by the string argument. The filename can be a full path and can also include a file extension if needed.

plt.figure(num=1,figsize=(4,4),dpi=100)
plt.plot(x,y,marker='o')
plt.savefig('second_plot.png')
plt.close()

Creating a line plot by passing a single array

We can pass a single dataset or an array to the plot function as shown in cell below. The plot function uses the values 0, 1, …, N-1 as the x coordinates where ‘N’ is the size of the y array.

plt.title('Second Plot')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.plot(y)
plt.show()

Multiple plots on the same graph

We can plot multiple plots on the same graph by calling the plot function for each dataset pair, this is useful to compare the plots. Each plot is rendered on top of another plot. Notice how Matplotlib applies a different color to each plot. The plots share the figure, x and y axis.

plt.plot(x,y)
plt.plot(x,x**2)
plt.plot(x,x**1/2)
plt.title('Multiple Plots')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()

The above code can be re-written as follows, that is by passing each dataset pair in an order as arguments to the plot function.

plt.plot(x,y,x,x**2,x,x**1/2)
plt.title('Multiple Plots2')
plt.xlabel('x-axis')
plt.ylabel('y-axis')
plt.show()

Line Properties

A line drawn on a graph has several properties such as color, width, style, transparency etc., these properties can be customized as per our requirement when we call the plot function.

plt.plot(x,y,color='c',marker='o',linewidth=2,linestyle='--',alpha=0.5,label='line1')
plt.plot(x,x**2,color='#90BC38',marker='D',linewidth=2,alpha=0.5,label='line2')
plt.legend()
plt.show(

Color codes

Colors in data visualization are used to enhance the look of the graphs, communicate the information clearly and to distinguish one set of data from another. The following basic colors are defined in Matplotlib.

Matplotlib also supports HEX colors. Web designers and developers use HEX colors in web designing. A HEX color is represented as a six-digit combination of numbers and letters defined by the amount of red, green and blue (RGB) that makes up the color.

Linestyle

Linestyle specifies whether the line is solid, dashed etc.,

Markers

Markers on a line plot are used to highlight particular data points.

Legend

A legend is a box associating labels(text) with lines on a graph. The legend() method is used to add a legend to the plot. This method can be called in multiple ways:

  1. plt.legend() – When no arguments are passed to the legend() method, the plots to be added in the legend are automatically detected, and the corresponding labels are used in the legend.
  2. plt.legend([‘label1’, ‘label2’, ‘label3’]) – The legend method can also be called by passing a list of string labels, where each string is used as a label for the plots in the order they were created. This method can be used to create a legend for the plots already existing on the axes. Note: This way of using the legend method is often discouraged because you should remember the order in which the plots are created which can be confusing.
  3. plt.legend([plot1,plot2,plot3],[‘label1′,’label2′,’label3’]) – We can explicitly specify the plots and labels by passing the list of plots followed by the list of string labels arranged in order to the legend method.

Introduction to Matplotlib

Introduction to Matplotlib


  Data visualization

Contents

Data visualization

Data visualization is the representation of data in a graphical format. It helps putting data in a visual form. Information, relationships, patterns that might go unnoticed in a text-based format can be easily recognized with data visualization software. This is because the human brain can understand and process visuals such as images, graphs or charts more easily compared to having the data in spreadsheets or in the form of reports. Data visualizations can turn large and small datasets into visuals.

The table below displays an independent variable ‘x’ and three functions a,b and c. Each of these functions are dependent on the variable ‘x’. Let us use the data in the table below to plot these functions.

From the plot, we can observe the relation between the variable and each of the functions and can infer that the cubic function(c) grows much faster compared to the identity function(a) and the square function(b).

In this example, we have a small dataset, so analysing data from the table is easy. But what if we have a dataset with millions of entries or a complex function to be analysed. In that case, having a graphical representation of data would be useful. There are various types of graphs or charts to represent data in a visual form. The type of data given and what we want to convey to the user determine the appropriate graph to be used. Line plots, Pie charts, Bar charts, Histogram, Scatter Plots etc., are few examples of graphs.

Matplotlib

This course will take an in-depth look at the Matplotlib tool for visualization in Python.

Matplotlib is a Python package that is widely used throughout the scientific Python community to create high-quality and publication-ready graphics.

Matplotlib is Python’s alternative to MATLAB and it has the advantage of being free and open-source, whereas MATLAB is expensive and closed source.

The Matplotlib library provides the pyplot module, which contains functions which closely resemble the MATLAB plotting syntax and functionality.

Matplotlib is built on Numpy arrays. It supports a wide range of export formats suitable for both web and print publishing. It supports high-quality output formats such as PNG, PDF, SVG, EPS and PGF.

Installing Matplotlib

Install Matplotlib with pip

Matplotlib can be installed using the Python package manager, pip. To install Matplotlib with pip, open a terminal window and type:

$ pip install matplotlib

This command installs Matplotlib in the current working Python environment.

Install Matplotlib with the Anaconda distribution of Python

The easiest way to install Matplotlib is to download and install the Anaconda distribution of Python. The Anaconda distribution of Python comes with Matplotlib included and no further installation steps are required. You can download the latest version of Anaconda by following this link – https://www.anaconda.com/download/.

Backend

Matplotlib uses a backend to render the plots. Backend is a utility used to create graphs. There are two types of backends, interactive and non-interactive. Interactive backends display the figure in a graphical user interface, which allows us to pan and zoom the figure. Non-interactive backends are used to produce image files. Matplotlib supports the following backends:

Backends: GTKAgg, GTK3Agg, GTK, GTKCairo, GTK3Cairo, WXAgg

The Jupyter Notebook supports the ‘inline’ backend. With this backend, the output of plotting commands is displayed inline, that is directly below the code cell that produced it. The inline backend renders a static or a stand alone plot. The resulting plots will be stored in the notebook document.

The ‘inline’ backend can be invoked using the following command: %matplotlib inline

The Jupyter notebook also supports the ‘notebook’ backend which renders an interactive plot. Just below the plot, we can find a toolbar to switch views, pan, zoom and download options.

The ‘notebook’ backend can be invoked using the following command: %matplotlib notebook

Basic Plotting with Matplotlib

%matplotlib inline
from matplotlib import pyplot as plt
plt.plot([1,2,3,4,5],[1,2,3,4,5])
plt.show()

%matplotlib notebook
from matplotlib import pyplot as plt
plt.plot([1,2,3,4,5],[5,4,3,2,1])
plt.show()