Logistic Regression
Contents
- What is Logistic Regression
- Math
- Implementation
- Dataset
- Modeling
- Visualization
- Basic Evaluation
- Optimization
- Evaluation
- ROC Curve
- Area under ROC Curve
- References
What is Logistic Regression
Logistic regression is a type of linear regression. However, it is used for classification only. Huh.. that’s confusing, right ? Let’s dive in.
Let’s take the simple iris dataset. The target variable as you know by now ( from day 9 – Introduction to Classification in Python, where we discussed classification using K Nearest neighbours ) is categorical in nature. Let’s load the data first.
head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species <dbl> <dbl> <dbl> <dbl> <fct> 1 5.1 3.5 1.4 0.2 setosa 2 4.9 3.0 1.4 0.2 setosa 3 4.7 3.2 1.3 0.2 setosa 4 4.6 3.1 1.5 0.2 setosa 5 5.0 3.6 1.4 0.2 setosa 6 5.4 3.9 1.7 0.4 setosa
Let’s simplify the dataset to just 2 species –
- 0 – setosa
- 1 – versi-color
Let’s just take data for 2 of the species ( say setosa and versi-color ) with just the sepal data ( sepal length and sepal width ) and plot it.
iris_data = iris[0:100,]
iris_data = iris[0:100,] plot(iris_data$Sepal.Length, iris_data$Sepal.Width, col = iris_data$Species, pch = 19, xlab = "Sepal Length", ylab = "Sepal Width")

Let’s simplify this further – say, we wanted to predict the species based on a single parameter – Sepal Length. Let’s first plot it.
plot ( iris_data[,1], iris_data[,5], pch = 19, col = "blue", xlab = "Sepal Length", ylab = "Setosa or Versicolor")

We know that regression is used to predict a continous variable. What about a categorical variable like this ? (species). If we can draw a curve like this,

and for all target values predicted with value > 0.5 put it in one category, and for all target values less than 0.5, put it in the other category – like this.

A linear regression (multilinear in this case) equation looks like this.
Logistic regression is almost similar to linear regression. The difference lies in how the predictor is calculated. Let’s see it in the next section.
Math
The name logistic regression is derived from the logit function. This function is based on odds.
logit function
Let’s take an example. A standard dice roll has 6 outcomes. So, what is the probability of landing a 4 ?
Now, what about odds ? The odds of landing a 4 is
So, when we substitute p into the odds equation, it becomes
OK. Now that we understand Probability and Odds, let’s get to the log of odds.
How exactly is the logistic regression similar to linear regression ? Like so.
Where the predictor ( log of odds ) varies between ( -∞ to +∞ ).
To understand this better, let’s plot the log of odds between a probabilty value of 0 and 1.
x = runif ( 100, min = 0.1, max = 0.9) x = sort ( x ) y = log ( x / (1-x))
# plt.plot(x,y) # plt.grid() plot ( x, y , type="l") grid(5, 5, lwd = 2)

This is the logistic regression curve. It maps a probability value ( 0 to 1 ) to a number ( -∞ to +∞ ). However, we are not looking for a continous variable, right ? The predictor we are looking for is a categorical variable – in our case, we said we would be able to predict this based on probability.
- p >= 0.5 – Category 1
- p < 0.5 – Category 2
In order to calculate those probabilities, we would have to calculate the inverse function of the logit function.
sigmoid function
The inverse of the logit curve is the inverse-logit or sigmoid function ( or expit function). The sigmoid function transforms the numbers ( -∞ to +∞ ) back to values between 0 and 1. Here is the formula for the sigmoid function.
x_new = y y_new = exp(x_new) / (1 + exp(x_new)) plot (x_new, y_new, type="l") grid(5, 5, lwd = 2)

Essentially, if we flip the logit function 900, you get the sigmoid function.


Here is the trick – As long as we are able to find a curve like the one below, although the target (predictor) is a value between 0 and 1 ( probabilities), we can say that all values below 0.5 ( half way mark ) belongs to one category and the remaining ( values above 0.5 ) belong to the next category. This is the essence of logistic regression.

Implementation
Let’s try to implement the logistic regression function in R step by step.
Data & Modeling
Just to keep the same example going, let’s try to fit the sepal length data to try and predict the species as either Setosa or Versicolor.
model = glm(Species ~ Sepal.Length + Sepal.Width, data=iris_data, family=binomial(link="logit"))
Warning message: "glm.fit: algorithm did not converge" Warning message: "glm.fit: fitted probabilities numerically 0 or 1 occurred"
y_pred = predict(model, iris_data, type="response")
y_pred
1 2.22044604925031e-16 2 2.22044604925031e-16 3 2.22044604925031e-16 4 2.22044604925031e-16 5 2.22044604925031e-16 6 2.22044604925031e-16 7 2.22044604925031e-16 8 2.22044604925031e-16 9 2.22044604925031e-16 10 2.22044604925031e-16 11 2.22044604925031e-16 12 2.22044604925031e-16 13 2.22044604925031e-16 14 2.22044604925031e-16 15 2.22044604925031e-16 16 2.22044604925031e-16 17 2.22044604925031e-16 18 2.22044604925031e-16 19 7.94264732123352e-13 20 2.22044604925031e-16 21 7.2365355113674e-11 22 2.22044604925031e-16 23 2.22044604925031e-16 24 2.22044604925031e-16 25 2.22044604925031e-16 26 2.22044604925031e-16 27 2.22044604925031e-16 28 2.22044604925031e-16 29 2.22044604925031e-16 30 2.22044604925031e-16 31 2.22044604925031e-16 32 7.2365355113674e-11 33 2.22044604925031e-16 34 2.22044604925031e-16 35 2.22044604925031e-16 36 2.22044604925031e-16 37 6.31794371395771e-10 38 2.22044604925031e-16 39 2.22044604925031e-16 40 2.22044604925031e-16 41 2.22044604925031e-16 42 9.0266451369787e-10 43 2.22044604925031e-16 44 2.22044604925031e-16 45 2.22044604925031e-16 46 2.22044604925031e-16 47 2.22044604925031e-16 48 2.22044604925031e-16 49 2.22044604925031e-16 50 2.22044604925031e-16 51 1 52 1 53 1 54 1 55 1 56 1 57 1 58 0.999999999144556 59 1 60 0.999999999998715 61 1 62 1 63 1 64 1 65 1 66 1 67 1 68 1 69 1 70 1 71 1 72 1 73 1 74 1 75 1 76 1 77 1 78 1 79 1 80 1 81 1 82 1 83 1 84 1 85 0.999999998977491 86 1 87 1 88 1 89 1 90 1 91 1 92 1 93 1 94 1 95 1 96 1 97 1 98 1 99 1 100 1
These are probability values. We need to convert them to the actual factors (setosa & versicolor), because, we are dealing with just 2 classes. We can just use a simple ifelse ( ) syntax to convert all values > 0.5 to a versicolor and < 0.5 to a setosa.
y_pred_levels = as.factor(ifelse(y_pred>0.5,"versicolor","setosa"))
y_pred_levels
1 setosa 2 setosa 3 setosa 4 setosa 5 setosa 6 setosa 7 setosa 8 setosa 9 setosa 10 setosa 11 setosa 12 setosa 13 setosa 14 setosa 15 setosa 16 setosa 17 setosa 18 setosa 19 setosa 20 setosa 21 setosa 22 setosa 23 setosa 24 setosa 25 setosa 26 setosa 27 setosa 28 setosa 29 setosa 30 setosa 31 setosa 32 setosa 33 setosa 34 setosa 35 setosa 36 setosa 37 setosa 38 setosa 39 setosa 40 setosa 41 setosa 42 setosa 43 setosa 44 setosa 45 setosa 46 setosa 47 setosa 48 setosa 49 setosa 50 setosa 51 versicolor 52 versicolor 53 versicolor 54 versicolor 55 versicolor 56 versicolor 57 versicolor 58 versicolor 59 versicolor 60 versicolor 61 versicolor 62 versicolor 63 versicolor 64 versicolor 65 versicolor 66 versicolor 67 versicolor 68 versicolor 69 versicolor 70 versicolor 71 versicolor 72 versicolor 73 versicolor 74 versicolor 75 versicolor 76 versicolor 77 versicolor 78 versicolor 79 versicolor 80 versicolor 81 versicolor 82 versicolor 83 versicolor 84 versicolor 85 versicolor 86 versicolor 87 versicolor 88 versicolor 89 versicolor 90 versicolor 91 versicolor 92 versicolor 93 versicolor 94 versicolor 95 versicolor 96 versicolor 97 versicolor 98 versicolor 99 versicolor 100 versicolor
library(caret) cm = confusionMatrix(y_pred_levels,iris_data[,5]) cm
Warning message in levels(reference) != levels(data): "longer object length is not a multiple of shorter object length" Warning message in confusionMatrix.default(y_pred_levels, iris_data[, 5]): "Levels are not in the same order for reference and data. Refactoring data to match."
Confusion Matrix and Statistics Reference Prediction setosa versicolor virginica setosa 50 0 0 versicolor 0 50 0 virginica 0 0 0 Overall Statistics Accuracy : 1 95% CI : (0.9638, 1) No Information Rate : 0.5 P-Value [Acc > NIR] : < 2.2e-16 Kappa : 1 Mcnemar's Test P-Value : NA Statistics by Class: Class: setosa Class: versicolor Class: virginica Sensitivity 1.0 1.0 NA Specificity 1.0 1.0 1 Pos Pred Value 1.0 1.0 NA Neg Pred Value 1.0 1.0 NA Prevalence 0.5 0.5 0 Detection Rate 0.5 0.5 0 Detection Prevalence 0.5 0.5 0 Balanced Accuracy 1.0 1.0 NA
Basic Evaluation
Let’s split up the data into training and test data and model it. This time let’s do the full iris dataset. Since there are 3 Species to be predicted, we cannot use glm with a “binomial” family algorithm. Let’s use another library called nnet. As usual, to evaluate categorical target data, we use a confusion matrix.
library(nnet)
index = sample(1:nrow(iris),nrow(iris)*.8) train = iris[index,] test = iris[-index,]
model = multinom(Species~.,data = train)
# weights: 18 (10 variable) initial value 131.833475 iter 10 value 11.516467 iter 20 value 4.881298 iter 30 value 4.469920 iter 40 value 4.263054 iter 50 value 3.911756 iter 60 value 3.823284 iter 70 value 3.598069 iter 80 value 3.591202 iter 90 value 3.570975 iter 100 value 3.570835 final value 3.570835 stopped after 100 iterations
pred = predict(model,test)
As usual, to evaluate categorical target data, we use a confusion matrix.
library(caret) cm = confusionMatrix(pred, as.factor(test$Species)) cm
Confusion Matrix and Statistics Reference Prediction setosa versicolor virginica setosa 16 0 0 versicolor 0 8 1 virginica 0 2 3 Overall Statistics Accuracy : 0.9 95% CI : (0.7347, 0.9789) No Information Rate : 0.5333 P-Value [Acc > NIR] : 1.989e-05 Kappa : 0.8315 Mcnemar's Test P-Value : NA Statistics by Class: Class: setosa Class: versicolor Class: virginica Sensitivity 1.0000 0.8000 0.7500 Specificity 1.0000 0.9500 0.9231 Pos Pred Value 1.0000 0.8889 0.6000 Neg Pred Value 1.0000 0.9048 0.9600 Prevalence 0.5333 0.3333 0.1333 Detection Rate 0.5333 0.2667 0.1000 Detection Prevalence 0.5333 0.3000 0.1667 Balanced Accuracy 1.0000 0.8750 0.8365
That’s a 84% score.
Optimization
Let’s plot the logistic regression curve for the test data set.
Step 1 – Get the data
iris_data = iris[51:150,] iris_data = iris_data[order(iris_data$Sepal.Length),] head(iris_data)
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species | |
---|---|---|---|---|---|
<dbl> | <dbl> | <dbl> | <dbl> | <fct> | |
58 | 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
107 | 4.9 | 2.5 | 4.5 | 1.7 | virginica |
61 | 5.0 | 2.0 | 3.5 | 1.0 | versicolor |
94 | 5.0 | 2.3 | 3.3 | 1.0 | versicolor |
99 | 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
60 | 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
Step 2 – Model the data using a classifier
model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial)
Step 3 – Plot the Logit curve.
library(pROC) plot( iris_data$Sepal.Length, iris_data$Species, pch = 19, col = iris_data$Species) points ( iris_data$Sepal.Length, model$fitted.values + 2, col = "orange", type="l", lty=2, lwd=3)


As you can see, still there are quite a bit of mis-classifications. All the false negatives and false positives in the plot above are examples of mis-classification. Irrespective of the algorithm used to calculate the fit, there is only so much that can be done in increasing the classification accuracy given the data as-is. Other terms for True Positive and True Negative are
- Sensitivity ( True Positive )
- Specificity ( True Negative )
There is a specific optimization that can be done – and that is to specifically increase accuracy of one segment of the confusion matrix at the expense of the other segments. For example, if you look at a visual of the confusion matrix for our dataset.

For this dataset, classifying the species as “setosa” is positive and “versi-color” as negative.
- setosa – positive
- versi-color – negative
Let’s actuall calculate the accuracy values. Say the confusion matrix looks like this.[[11 1] [ 1 12]]
tp = (11) / (11 + 1) fp = (1) / (11 + 1) fn = (1) / (1 + 12) tn = (12) / (12 + 1) cat ("True Positive = ", tp, "\n") cat ("False Positive = ", fp, "\n") cat ("True Negative = ", tn, "\n") cat ("False Negative = ", fn, "\n")
True Positive = 0.9166667 False Positive = 0.08333333 True Negative = 0.9230769 False Negative = 0.07692308

What if we want to predict 100% of setosa ( or a much more accurate classification than 0.9 ). Of course, like we discussed earlier, it will come at a cost. However, there is a usecase for this scenario. For example, if getting a particular classification right is extremely important, then we focus more on that particular classification than the others. Have you seen the Brad Pitt’s movie “World War Z” ? A plague emerges all around the world and an asylum is set up in Israel with a high wall. However, when you enter the wall, they make absolutely sure that you do not have the plague. Say, if you have the plague and if you call that as positive, then essentially you maximize the green box in the picture above.
Or another example would be, if you were to diagonize cancer patients, you would rather want to increase the odds of predicting a cancer patient if he/she really has it (True positive). Even it it comes at a cost of wrongly classifying a non-cancer patient as positive ( false positive ). The former can save a life while the later will just cost the company a patient.
Evaluation
ROC Curve
Receiver Operating Characteristics – also called ROC Curve is a measure of how good the classification is. Scikit Learn has an easy way to create ROC curve and calculate the area under the ROC curve. First off, let’s start with a classifier like Logistic Regression
Step 1 – Get the data
iris_data = iris[51:150,] iris_data = iris_data[order(iris_data$Sepal.Length),] model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial) library(pROC) # iris$Species has 3 classes and hence 3 factors. So, we are converting them to # 0/1 factor using factor (c(iris_data$Species) - 2). # -2 so that 3,2 become 1,0 roc = roc ( factor (c(iris_data$Species) - 2), model$fitted.values, plot = TRUE, legacy.axes = TRUE)
Setting levels: control = 0, case = 1 Setting direction: controls < cases

Area under the curve is an indicator of how accurate our classifier is. You can get it as follows.
roc$auc
0.7896
Reference
- Scikit Learn Implementation details of Logistic Regression
- Receiver Operating Characteristics (ROC) Curve from scratch