Logistic Regression
Contents
- What is Logistic Regression
- Math
- Implementation
- Dataset
- Modeling
- Visualization
- Basic Evaluation
- Optimization
- Evaluation
- ROC Curve
- Area under ROC Curve
- References
What is Logistic Regression
Logistic regression is a type of linear regression. However, it is used for classification only. Huh.. that’s confusing, right ? Let’s dive in.
Let’s take the simple iris dataset. The target variable as you know by now ( from day 9 – Introduction to Classification in Python, where we discussed classification using K Nearest neighbours ) is categorical in nature. Let’s load the data first.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
<dbl> <dbl> <dbl> <dbl> <fct>
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
Let’s simplify the dataset to just 2 species –
- 0 – setosa
- 1 – versi-color
Let’s just take data for 2 of the species ( say setosa and versi-color ) with just the sepal data ( sepal length and sepal width ) and plot it.
iris_data = iris[0:100,]
iris_data = iris[0:100,]
plot(iris_data$Sepal.Length, iris_data$Sepal.Width,
col = iris_data$Species, pch = 19,
xlab = "Sepal Length",
ylab = "Sepal Width")

Let’s simplify this further – say, we wanted to predict the species based on a single parameter – Sepal Length. Let’s first plot it.
plot ( iris_data[,1], iris_data[,5],
pch = 19, col = "blue",
xlab = "Sepal Length",
ylab = "Setosa or Versicolor")

We know that regression is used to predict a continous variable. What about a categorical variable like this ? (species). If we can draw a curve like this,

and for all target values predicted with value > 0.5 put it in one category, and for all target values less than 0.5, put it in the other category – like this.

A linear regression (multilinear in this case) equation looks like this.
Logistic regression is almost similar to linear regression. The difference lies in how the predictor is calculated. Let’s see it in the next section.
Math
The name logistic regression is derived from the logit function. This function is based on odds.
logit function
Let’s take an example. A standard dice roll has 6 outcomes. So, what is the probability of landing a 4 ?
Now, what about odds ? The odds of landing a 4 is
So, when we substitute p into the odds equation, it becomes
OK. Now that we understand Probability and Odds, let’s get to the log of odds.
How exactly is the logistic regression similar to linear regression ? Like so.
Where the predictor ( log of odds ) varies between ( -∞ to +∞ ).
To understand this better, let’s plot the log of odds between a probabilty value of 0 and 1.
x = runif ( 100, min = 0.1, max = 0.9)
x = sort ( x )
y = log ( x / (1-x))
# plt.plot(x,y)
# plt.grid()
plot ( x, y , type="l")
grid(5, 5, lwd = 2)

This is the logistic regression curve. It maps a probability value ( 0 to 1 ) to a number ( -∞ to +∞ ). However, we are not looking for a continous variable, right ? The predictor we are looking for is a categorical variable – in our case, we said we would be able to predict this based on probability.
- p >= 0.5 – Category 1
- p < 0.5 – Category 2
In order to calculate those probabilities, we would have to calculate the inverse function of the logit function.
sigmoid function
The inverse of the logit curve is the inverse-logit or sigmoid function ( or expit function). The sigmoid function transforms the numbers ( -∞ to +∞ ) back to values between 0 and 1. Here is the formula for the sigmoid function.
x_new = y
y_new = exp(x_new) / (1 + exp(x_new))
plot (x_new, y_new, type="l")
grid(5, 5, lwd = 2)

Essentially, if we flip the logit function 900, you get the sigmoid function.


Here is the trick – As long as we are able to find a curve like the one below, although the target (predictor) is a value between 0 and 1 ( probabilities), we can say that all values below 0.5 ( half way mark ) belongs to one category and the remaining ( values above 0.5 ) belong to the next category. This is the essence of logistic regression.

Implementation
Let’s try to implement the logistic regression function in R step by step.
Data & Modeling
Just to keep the same example going, let’s try to fit the sepal length data to try and predict the species as either Setosa or Versicolor.
model = glm(Species ~ Sepal.Length + Sepal.Width,
data=iris_data,
family=binomial(link="logit"))
Warning message:
"glm.fit: algorithm did not converge"
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"
y_pred = predict(model, iris_data, type="response")
y_pred
1
2.22044604925031e-16
2
2.22044604925031e-16
3
2.22044604925031e-16
4
2.22044604925031e-16
5
2.22044604925031e-16
6
2.22044604925031e-16
7
2.22044604925031e-16
8
2.22044604925031e-16
9
2.22044604925031e-16
10
2.22044604925031e-16
11
2.22044604925031e-16
12
2.22044604925031e-16
13
2.22044604925031e-16
14
2.22044604925031e-16
15
2.22044604925031e-16
16
2.22044604925031e-16
17
2.22044604925031e-16
18
2.22044604925031e-16
19
7.94264732123352e-13
20
2.22044604925031e-16
21
7.2365355113674e-11
22
2.22044604925031e-16
23
2.22044604925031e-16
24
2.22044604925031e-16
25
2.22044604925031e-16
26
2.22044604925031e-16
27
2.22044604925031e-16
28
2.22044604925031e-16
29
2.22044604925031e-16
30
2.22044604925031e-16
31
2.22044604925031e-16
32
7.2365355113674e-11
33
2.22044604925031e-16
34
2.22044604925031e-16
35
2.22044604925031e-16
36
2.22044604925031e-16
37
6.31794371395771e-10
38
2.22044604925031e-16
39
2.22044604925031e-16
40
2.22044604925031e-16
41
2.22044604925031e-16
42
9.0266451369787e-10
43
2.22044604925031e-16
44
2.22044604925031e-16
45
2.22044604925031e-16
46
2.22044604925031e-16
47
2.22044604925031e-16
48
2.22044604925031e-16
49
2.22044604925031e-16
50
2.22044604925031e-16
51
1
52
1
53
1
54
1
55
1
56
1
57
1
58
0.999999999144556
59
1
60
0.999999999998715
61
1
62
1
63
1
64
1
65
1
66
1
67
1
68
1
69
1
70
1
71
1
72
1
73
1
74
1
75
1
76
1
77
1
78
1
79
1
80
1
81
1
82
1
83
1
84
1
85
0.999999998977491
86
1
87
1
88
1
89
1
90
1
91
1
92
1
93
1
94
1
95
1
96
1
97
1
98
1
99
1
100
1
These are probability values. We need to convert them to the actual factors (setosa & versicolor), because, we are dealing with just 2 classes. We can just use a simple ifelse ( ) syntax to convert all values > 0.5 to a versicolor and < 0.5 to a setosa.
y_pred_levels = as.factor(ifelse(y_pred>0.5,"versicolor","setosa"))
y_pred_levels
1
setosa
2
setosa
3
setosa
4
setosa
5
setosa
6
setosa
7
setosa
8
setosa
9
setosa
10
setosa
11
setosa
12
setosa
13
setosa
14
setosa
15
setosa
16
setosa
17
setosa
18
setosa
19
setosa
20
setosa
21
setosa
22
setosa
23
setosa
24
setosa
25
setosa
26
setosa
27
setosa
28
setosa
29
setosa
30
setosa
31
setosa
32
setosa
33
setosa
34
setosa
35
setosa
36
setosa
37
setosa
38
setosa
39
setosa
40
setosa
41
setosa
42
setosa
43
setosa
44
setosa
45
setosa
46
setosa
47
setosa
48
setosa
49
setosa
50
setosa
51
versicolor
52
versicolor
53
versicolor
54
versicolor
55
versicolor
56
versicolor
57
versicolor
58
versicolor
59
versicolor
60
versicolor
61
versicolor
62
versicolor
63
versicolor
64
versicolor
65
versicolor
66
versicolor
67
versicolor
68
versicolor
69
versicolor
70
versicolor
71
versicolor
72
versicolor
73
versicolor
74
versicolor
75
versicolor
76
versicolor
77
versicolor
78
versicolor
79
versicolor
80
versicolor
81
versicolor
82
versicolor
83
versicolor
84
versicolor
85
versicolor
86
versicolor
87
versicolor
88
versicolor
89
versicolor
90
versicolor
91
versicolor
92
versicolor
93
versicolor
94
versicolor
95
versicolor
96
versicolor
97
versicolor
98
versicolor
99
versicolor
100
versicolor
library(caret)
cm = confusionMatrix(y_pred_levels,iris_data[,5])
cm
Warning message in levels(reference) != levels(data):
"longer object length is not a multiple of shorter object length"
Warning message in confusionMatrix.default(y_pred_levels, iris_data[, 5]):
"Levels are not in the same order for reference and data. Refactoring data to match."
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 50 0 0
versicolor 0 50 0
virginica 0 0 0
Overall Statistics
Accuracy : 1
95% CI : (0.9638, 1)
No Information Rate : 0.5
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 1
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0 1.0 NA
Specificity 1.0 1.0 1
Pos Pred Value 1.0 1.0 NA
Neg Pred Value 1.0 1.0 NA
Prevalence 0.5 0.5 0
Detection Rate 0.5 0.5 0
Detection Prevalence 0.5 0.5 0
Balanced Accuracy 1.0 1.0 NA
Basic Evaluation
Let’s split up the data into training and test data and model it. This time let’s do the full iris dataset. Since there are 3 Species to be predicted, we cannot use glm with a “binomial” family algorithm. Let’s use another library called nnet. As usual, to evaluate categorical target data, we use a confusion matrix.
library(nnet)
index = sample(1:nrow(iris),nrow(iris)*.8)
train = iris[index,]
test = iris[-index,]
model = multinom(Species~.,data = train)
# weights: 18 (10 variable)
initial value 131.833475
iter 10 value 11.516467
iter 20 value 4.881298
iter 30 value 4.469920
iter 40 value 4.263054
iter 50 value 3.911756
iter 60 value 3.823284
iter 70 value 3.598069
iter 80 value 3.591202
iter 90 value 3.570975
iter 100 value 3.570835
final value 3.570835
stopped after 100 iterations
pred = predict(model,test)
As usual, to evaluate categorical target data, we use a confusion matrix.
library(caret)
cm = confusionMatrix(pred, as.factor(test$Species))
cm
Confusion Matrix and Statistics
Reference
Prediction setosa versicolor virginica
setosa 16 0 0
versicolor 0 8 1
virginica 0 2 3
Overall Statistics
Accuracy : 0.9
95% CI : (0.7347, 0.9789)
No Information Rate : 0.5333
P-Value [Acc > NIR] : 1.989e-05
Kappa : 0.8315
Mcnemar's Test P-Value : NA
Statistics by Class:
Class: setosa Class: versicolor Class: virginica
Sensitivity 1.0000 0.8000 0.7500
Specificity 1.0000 0.9500 0.9231
Pos Pred Value 1.0000 0.8889 0.6000
Neg Pred Value 1.0000 0.9048 0.9600
Prevalence 0.5333 0.3333 0.1333
Detection Rate 0.5333 0.2667 0.1000
Detection Prevalence 0.5333 0.3000 0.1667
Balanced Accuracy 1.0000 0.8750 0.8365
That’s a 84% score.
Optimization
Let’s plot the logistic regression curve for the test data set.
Step 1 – Get the data
iris_data = iris[51:150,]
iris_data = iris_data[order(iris_data$Sepal.Length),]
head(iris_data)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <fct> | |
58 | 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
107 | 4.9 | 2.5 | 4.5 | 1.7 | virginica |
61 | 5.0 | 2.0 | 3.5 | 1.0 | versicolor |
94 | 5.0 | 2.3 | 3.3 | 1.0 | versicolor |
99 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
60 | 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
Step 2 – Model the data using a classifier
model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial)
Step 3 – Plot the Logit curve.
library(pROC)
plot( iris_data$Sepal.Length, iris_data$Species, pch = 19, col = iris_data$Species)
points ( iris_data$Sepal.Length, model$fitted.values + 2, col = "orange", type="l", lty=2, lwd=3)


As you can see, still there are quite a bit of mis-classifications. All the false negatives and false positives in the plot above are examples of mis-classification. Irrespective of the algorithm used to calculate the fit, there is only so much that can be done in increasing the classification accuracy given the data as-is. Other terms for True Positive and True Negative are
- Sensitivity ( True Positive )
- Specificity ( True Negative )
There is a specific optimization that can be done – and that is to specifically increase accuracy of one segment of the confusion matrix at the expense of the other segments. For example, if you look at a visual of the confusion matrix for our dataset.

For this dataset, classifying the species as “setosa” is positive and “versi-color” as negative.
- setosa – positive
- versi-color – negative
Let’s actuall calculate the accuracy values. Say the confusion matrix looks like this.[[11 1] [ 1 12]]
tp = (11) / (11 + 1)
fp = (1) / (11 + 1)
fn = (1) / (1 + 12)
tn = (12) / (12 + 1)
cat ("True Positive = ", tp, "\n")
cat ("False Positive = ", fp, "\n")
cat ("True Negative = ", tn, "\n")
cat ("False Negative = ", fn, "\n")
True Positive = 0.9166667
False Positive = 0.08333333
True Negative = 0.9230769
False Negative = 0.07692308

What if we want to predict 100% of setosa ( or a much more accurate classification than 0.9 ). Of course, like we discussed earlier, it will come at a cost. However, there is a usecase for this scenario. For example, if getting a particular classification right is extremely important, then we focus more on that particular classification than the others. Have you seen the Brad Pitt’s movie “World War Z” ? A plague emerges all around the world and an asylum is set up in Israel with a high wall. However, when you enter the wall, they make absolutely sure that you do not have the plague. Say, if you have the plague and if you call that as positive, then essentially you maximize the green box in the picture above.
Or another example would be, if you were to diagonize cancer patients, you would rather want to increase the odds of predicting a cancer patient if he/she really has it (True positive). Even it it comes at a cost of wrongly classifying a non-cancer patient as positive ( false positive ). The former can save a life while the later will just cost the company a patient.
Evaluation
ROC Curve
Receiver Operating Characteristics – also called ROC Curve is a measure of how good the classification is. Scikit Learn has an easy way to create ROC curve and calculate the area under the ROC curve. First off, let’s start with a classifier like Logistic Regression
Step 1 – Get the data
iris_data = iris[51:150,]
iris_data = iris_data[order(iris_data$Sepal.Length),]
model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial)
library(pROC)
# iris$Species has 3 classes and hence 3 factors. So, we are converting them to
# 0/1 factor using factor (c(iris_data$Species) - 2).
# -2 so that 3,2 become 1,0
roc = roc ( factor (c(iris_data$Species) - 2), model$fitted.values, plot = TRUE, legacy.axes = TRUE)
Setting levels: control = 0, case = 1
Setting direction: controls < cases

Area under the curve is an indicator of how accurate our classifier is. You can get it as follows.
roc$auc
0.7896
Reference
- Scikit Learn Implementation details of Logistic Regression
- Receiver Operating Characteristics (ROC) Curve from scratch