ROC Curve

ROC Curve


  Machine Learning in Python

Contents

What is ROC Curve

ROC stands for Receiver Operating Characteristic. This is a statistical method developed during World War II to analyze the performance of a Radar Operator. Radar Operators were supposed to look at the Radar screen and analyze if a point represented an enemy plane or just random noise. Much later in the 80’s when drugs were being developed on a massive scale, the same (ROC) was used to measure the performance of a drug to cure diseases. To understand ROC curve better, let’s actually create on in excel.

ROC Curve in Excel

Creating an ROC curve in excel is easy if you have the right tools. However, we are going to do it the hard way – everything from scratch. It feels a bit tiring, but the purpose is to understand the concept of ROC. If you feel this is overwhelming, you can skip to the section where we Interpret the ROC Curve and do the ROC Curve in Python.

Data

We will take the iris dataset to create the ROC curve .

Just to make things simple, we will only use one predictor – Sepal Length.

ROC curves only deal with binary response variables. Since we have 3 species (SetosaVersicolorVirginica), let’s eliminate the 3rd species (Virginica) and just have two of them.

To reduce the data points further, let’s use brackets of sepal length instead of raw data. For example, here we have created 6 brackets (of sepal length) and against each of them, we have counted the occurances of the species. Think of it just as a compressed version of the full data set.

Calculation

At each of the sepal length brackets, start preparing the same table. For example, here is a snapshot of the same table summarized at a cut-off sepal length. The second row ( sepal length > 4.7 ) is a summation of all rows starging with the second sepal length bracket (4.8 – 5.2) till the end (6.8 – 7.0).

In case you are still confused, here is another visual that shows how the summation is done.

We have to understand what we are trying to do here. Logically, what we are saying is that if we set a threshold of 4.7 as the cutoff (for sepal length) and say everything below 4.7 is a setosa and everything after a cut-off of 4.7 (sepal length) is a versicolor. We are just trying to figure out how each of these cut-offs fare.

True Positive Rate & False Positive Rate

But how do we calculate the performance of the classification at each of the cut-off points ? By calculating the percentage/fraction of true numbers (already classified) against the cut-off predictions. For example, when the cut-off is 4.7 (for sepal length) how well is the classification performing. Refer to the Confusion Matrix to understand the following terms.

  • True Positive
  • False Positive
  • True Negative
  • False Negative

Let’s quickly recap two of the key terms for our ROC curve.

P.S. – In our case, positive means setosa and negative means versicolor.

  • True Positive (Sensitivity
    Proportion of Positive values correctly (True) identified as positive values. For example, in our case, the True Positive rate for a cut-off of sepal length < 4.7 is as follows

It is a measure of how accurate we are in precting the True values (setosa) out of the total True values(setosa).

  • False Positive (1 – Specificity
    In simple terms, it is the fraction of the Negative values, falsely classified as positive. Another way to put it is the Proportion of Negative values(versicolor in this case) incorrectly identified as Positive values. In our case, the False positive rate for a cut-off sepal length of 4.7 is as follows.

It is a measure of how wrongly the Negative values are classified (as Positive).

Now that we understand the meaning of True Positive Rate vs False positive rate, let’s go ahead and calculate these fractions at all different cut-off points.

Plot ROC Curve

Tabulate the False Positive Rate and True Positive Rate at each of the cut-off points.

and plot it.

Interpret ROC Curve

Now that we have drawn the ROC curve, what does it really mean ?

In this dataset, we have 2 separate groups – setosa and versi-color. And if there was a clear line separating these two classes (based on the predictor – sepal length), then there is no need for any of this stuff. But that that is rarely the case – there is always a overlap. If you draw a histogram, you can clearly see the overlap.

To create the ROC curve, we have assumed multiple cut-off (threshold) points along the predictor line and see how well the threshold is able to separate the species.

Shape of the curve

The ROC curve is basically an indication of how well the separation(between setosa and versicolor) is taking place at different threshold points. The threshold points themselves are typically not shown in the plot.

The black dotted line connecting the left button to the right top corner represents a random separator – anything below that is worse performing. The more the ROC curve is above that line, the better the performance is. Basically, the more closer the curve is to the upper left corner, the better the performance of the test.

Area under the curve

The area is a good indication of how well the curve hugs the top left corner. More the area, better the test. The maximum value possible is 1.0. Here are the typical values.

Typical values for a good test are as below.

  • < 0.5 – Useless
  • 0.5 < x < 0.70 – Good
  • 0.7 < x < 0.90 – Very good
  • 0.9 < x < 1.00 – Excellent

ROC Curve in Python

Scikit Learn has an easy way to create ROC curve and calculate the area under the ROC curve. First off, let’s start with a classifier like Logistic Regression and let it predict all the probabilities (thresholds).

Step 1 – Get the data

from sklearn import datasets

iris = datasets.load_iris()

# Get only the setosa and versicolor data
iris_data = iris.data[0:100,:]
iris_target = iris.target[0:100]

Step 2 – Model the data using a classifier

# split the data into train and test datasets
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split (iris_data[:,0], iris_target)

# Model the data using Logistic Regression
from sklearn import linear_model

model = linear_model.LogisticRegression(C=1e5, solver='lbfgs')
model.fit(iris_data[:,0].reshape(-1,1), iris_target)
LogisticRegression(C=100000.0, class_weight=None, dual=False,
          fit_intercept=True, intercept_scaling=1, max_iter=100,
          multi_class='warn', n_jobs=None, penalty='l2', random_state=None,
          solver='lbfgs', tol=0.0001, verbose=0, warm_start=False)

Step 3 – Use roc_curve function to create the True Positive Rate and False positive Rate.

from sklearn.metrics import roc_curve, auc

probabilities = model.predict_proba(X_test.reshape(-1,1))[:,1]
fpr, tpr, thresholds = roc_curve(y_test, probabilities)

Step 4 – Plot the ROC curve

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(fpr,tpr)
plt.plot([0,1],[0,1],color="black",linestyle="--")

plt.xlabel("False Positive Rate - FPR")
plt.ylabel("True Positive Rate - TPR ")
plt.title("Receiver Operating Characteristics - ROC Curve")
plt.text(0.6,0.5,"Baseline")
plt.text(0.3,0.8,"ROC Curve")
Text(0.3, 0.8, 'ROC Curve')

Step 5 – Calculate Area under the Curve

from sklearn.metrics import roc_auc_score

roc_auc_score(y_test, probabilities)
0.9097222222222223
%d bloggers like this: