Box plots

Box plots

Table of Contents

Spread of data and Central tendency

To analyze the distribution of quantitative data, we use two important features: spread of data and central tendency. The spread of data or the variability of the values in the data indicates how spread out a distribution is. Central tendency provides a descriptive summary of a data set through a single value that reflects the center of the data distribution.

Measure the spread of data

Two most commonly used methods to measure the spread of a distribution are Range and Interquartile Range (IQR).

Range: The range is the difference between the highest and lowest values in a data set and is the simplest measure of spread.

Interquartile Range(IQR): IQR indicates the spread of the middle 50% of the data.

In the above figure, we have two datasets Height_GroupA and Height_GroupC which represent the heights (in centimeters) of students belonging to different groups. Both groups have 15 students each. The data points for both the datasets are displayed on a number line in the figure.

The range of Height_GroupA is 78.6 while the range of Height_GroupC is 28.4. If the variation of values is more then the values lie far away from each other, if the variation of values is less then the values lie very close to each other. It is also evident from the above figure that the values in Height_GroupA are more dispersed or spread out compared to the values in the dataset Height_GroupC.

Measure the Central tendency

The three most common measures of central tendency are the mean, median, and mode.

Mean: The mean is the sum of the values in a set of data divided by the number of values in the data set.

Median: The median is the middle value in the data set when the values are arranged in ascending or descending order. 

Mode: The mode is the most frequently occurring value in the dataset.

The values in Height_GroupC range between 78.4 to 106.8 and its mean value is 96.1, which can be used to represent the values in the data set.  Let us modify the last value in this data set to 300. Can you use the new mean which is equal to 108.9 to represent all the values in this group? The new mean is nowhere close to any of the values in the data set and does not reflect the typical height of a student. The mean is not a robust tool because it is largely influenced by very low or high values in the data.

For distributions that have extreme values we can use the median instead of mean to represent the entire data set with a single value. The median is the middle value of an ordered data set.  Half of the data lies below the median and the other half above it. Median values are less likely to be influenced by very high or very low values in the data set. In the above figure, number enclosed in a circle is the median of the data set.

Introduction to Box plots and five-number summary

The spread or variability of a distribution can be graphically represented using Box plots. Box or Box-and-Whisker plots are used to visually represent the distribution of numerical data. Box plots also display the spread or range of data and show any signs of skewness in your data. Skewness in statistics represents an imbalance, it measures the lack of symmetry in data distribution.

Box plots are constructed using five key values. The key values are called a five-number summary. The five-number summary provides information as to where the data in a data set is clustered or where the data is scattered.

A five-number summary consists of the below numbers:

Minimum: The lowest value in the data, excluding outliers.

First Quartile (Q1): Twenty-five percent of values in the data fall below the first quartile value.

Median(Q2): When the data values are arranged in increasing order, median is the number located at the center of these values, it is generally used for skewed distributions.

Third Quartile (Q3): Seventy-five percent of the values fall below the third quartile.

Maximum: The highest value in the data, excluding outliers.

The first quartile, median and the third quartile divide the data into four equal parts, so that there will be approximately equal number of values in each of the parts.

As the name indicates box plots use a rectangle (or a box) for plotting and it has two parts, a box and a set of whiskers. Whiskers are the lines that extend from the box out to the lowest and highest values that are not outliers. Outliers are the data points that lie very far away from the whiskers, they are unusually large or small values in the data. Outliers can affect the overall data due to its very high or low extreme values. In a box plot, outliers are usually shown as dots on the either ends of the box away from the rest of the data.

Box plot for symmetric distribution

Box plot for symmetric distribution

Observe the above box plot for the dataset Height_GroupA. The lower edge of the rectangle denotes the First quartile, the upper edge denotes the Third quartile and the line at the centre is the median. The median is located at the center of the box, and it divides the dataset into two equal parts. When the median is at the center of the box, and the whiskers are about the same distance from the median on both sides of the box, then the distribution is said to be symmetric.

Range –> 169.2 – 90.6 = 78.6

IQR –> 159.5 – 99.9 = 59.6

The Range and IQR have high values which means the spread of values in the dataset is large.

Box plot for negatively skewed distribution

Box plot for negatively skewed distribution

The median divides the dataset into two equal parts. In the above box plot, the median is not located at the center of the box but is drawn closer to the third quartile(Q3). The median is closer to the upper end of the box because it gets pulled in the direction of cluster of values. Observe the values to the left of the median, they are more spread out or scattered. In this case, the distribution is said to be negatively skewed (skewed left). A distribution that has most of the values clustered on the right side of the distribution and has a long left tail is said to be negatively skewed.

  • Range –> 169.2-90.6 = 78.6
  • IQR –> 159.5-99.8 = 59.7

The Range and IQR have high values which means the spread of values in the dataset is large.

Box plot for negatively skewed distribution with less spread

Box plot for negatively skewed distribution with less spread

The median for the above box plot is closer to the third quartile(Q3) for the same reason as explained in the earlier example. In this example, the whisker on the lower end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. In this example, the distribution is negatively skewed (skewed left).

  • Range –> 106.8-78.4 = 28.4
  • IQR –> 104.7-90.9 = 13.8

Now that we have a basic understanding of box plots, let us graph some box plots using the Seaborn library. The boxplot() function in Seaborn is used to visualise the distribution of numeric data and also compare different categories or groups.

Creating plots using boxplot() function

Box plot for single distribution

Box plots can be used to visualize single or multiple distributions. Let us look at an example that visualizes a single distribution. For our analysis we will load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time. The input data can be passed directly to the x, y, and/or hue parameters of the boxplot() function. Let us visualize the quantitative data in column ‘total_bill’, this column is passed to the parameter ‘x’.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np

tips = sns.load_dataset('tips')
tips.head()

total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
sns.boxplot(data=tips,x='total_bill')
plt.title('Total bill')
plt.show()

Single distribution

The box plot is placed in the x-y plane and it displays the distribution of the quantitative variable using the five-number summary plotted along x-axis. From the above box plot, we see that the amount spent by the customers in the restaurant ranges between 3 – 41 $, excluding the outliers. The median is somewhere around 17 and is closer to the lower end of the box, which means the data points on the left side of the median are clustered. The median is far away from the upper end of the box, which means the data points on the right side of the median are scattered.

Notice that the whisker on the upper end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. Overall the distribution is positively skewed(skewed right). A distribution that has most of the values clustered on the left side of the distribution and has a long right tail is said to be positively skewed.

The dots located outside the right whisker are the outliers or the unusually large values in the data.

The above box plot is oriented horizontally, orientation can be changed by explicitly passing ‘v’ or ‘h’ to parameter ‘orient’.

Compare box plots of different categories – 1

In the next example, we will compare the tips received from the two groups male and female customers. Let us plot our data using the boxplot() function using the ‘sex’ and ‘tip’ columns.

sns.boxplot(data=tips,x='sex',y='tip')
plt.title('Tips received from male and female customers')
plt.show()

Compare different categories

In the tips dataset, ‘sex’ is a categorical grouping variable that splits the data into two parts- male and female customers. So we have two box plots generated for each category. The categorical variable ‘sex’ is mapped to the x-axis and the quantitative variable ‘tip’ is mapped to the y-axis.

Let us compare the two box plots – Male and Female:

  1. The two box plots have roughly identical medians.
  2. Male is more spread out compared to Female.
  3. Male has multiple outliers whereas Female has one outlier.
  4. Both the distributions are positively skewed.

Overall, it appears that females pay less tips compared to the males.

Compare box plots of different categories – 2

In this example, let us analyse the effects of the day of week on the bill amount at the restaurant. Let us pass the data in columns ‘day’ and ‘total_bill’ to the arguments ‘x’ and ‘y’ of the boxplot() function.

sns.boxplot(data=tips,x='day',y='total_bill',palette='Set1')
plt.title('Sales Analysis')
plt.show()

Compare different categories

Looking at the four boxplots Thur, Fri, Sat and Sun we can say that the customer spending is more during the weekends.

Side-by-side box plots

Previously we have split the data into groups based on the categorical variable. Let us further split these groups into sub-groups by introducing a third variable.

In the next example, the data in column ‘smoker’ is passed to the ‘hue’ parameter, which further categorizes the data and creates two side by side box plots for each group that is Male and Female. A separate colored boxplot represents each sub-group and a legend is added to let us know what each sub-category is. Using boxplots let us see who tip better.

sns.boxplot(data=tips,x='sex',y='tip',hue='smoker')
plt.title('Grouped Boxplots')
plt.show()

Side-by-side box plots

It is evident from the box plots that Males tip more than Females. If we compare the sub-groups (Smoker/Non-Smoker), Smokers tip better than Non-Smokers.

KDE plots

KDE plots


  Data visualization

Table of Contents

Histogram for the sum of two dice

Let us begin by setting up the input data. Suppose we are conducting an experiment by rolling two dice multiple times, every outcome of the experiment – which is the sum of the faces is recorded. We will visualize the distribution of this data using a Histogram.

We will call the randint() function defined in the numpy library to generate an array of random integers between 2 (smallest possible sum) and 12 (highest possible sum). This array is then passed to the hist() function in Matplotlib library to generate a Histogram. Initially the sample data contains 10 observations.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
x = np.random.randint(2,13,10)

x
array([ 2,  7,  6,  6,  4,  3,  3, 12,  2,  9])
plt.hist(x)
plt.show()

The hist() function automatically calculates the size of each bin of the histogram. The entire range of input data is split into equal-sized bins and then for each bin, the number of data points from the data set that fall into each bin are counted. A histogram displays the distribution of discrete data and it gives an estimate as to where the values are concentrated. From the above histogram, we observe that the outcomes 2, 3 and 6 have occurred 2 times while other outcomes have occurred 1 time.

Histogram for the sum of three dice

In the next example, we will see the distribution of the data when 3 dice are rolled, the sample data contains 50 observations.In [3]:

y = np.random.randint(3,19,50)
y
array([ 7,  9,  6, 13, 12, 11, 16, 13, 17,  6, 16, 10, 14, 18,  7, 11,  6,
       15, 15, 13, 10, 16, 15, 14, 17,  4,  9,  4, 12,  4, 13, 17,  7,  6,
        8, 13,  9, 13,  9, 17,  5, 17,  7,  7, 15,  8, 15, 17, 10, 17])
plt.hist(y)
plt.show()

The above histogram displays the distribution of continuous data. Each bin is spaced two numbers apart 4-6, 6-8, 8-10 and so on. Say for example, the default number of bins does not provide sufficient details of our distribution. So lets change few parameters of the histogram – the number of bins and lower range of the bins.

plt.hist(y,bins=24,range=(0,20))
plt.show()


As you can see changing the number of bins and the range affects the appearance of the histogram. When we change the number of bins, the data points get organised or grouped differently. The different grouping affects the appearance of the histogram. The appearance of a histogram can change markedly with different choices of number of bins and end points leading to different interpretations of same data. An alternative way is to use Kernel Density Plots which removes the dependency on the end points of the bins.

Kernel Density Estimate (KDE)

In the previous examples, we have used a histogram to estimate the distribution of data. Kernel density estimation(KDE) is another widely used technique for estimating the distribution of data. In a histogram, each value in the dataset is represented using rectangular bars/blocks, and the blocks are piled on top each other into the bins to show the number of values in each bin range. In a KDE plot, each data point in the dataset is represented using different shapes such as a box, triangle, Gaussian curve etc., also each data point contributes a small area around its true value. A KDE plot is produced by drawing a small continuous curve (also called kernel) for every individual data point along an axis, all of these curves are then added together to obtain a single smooth density estimation. Unlike a histogram, KDE produces a smooth estimate. When drawing the individual curves we allow the kernels to overlap with each other which removes the dependency on the end points of the bins.

Visualising data using histogram and KDE plot

The kernel width or bandwidth controls the smoothness of the resulting density curve. If the bandwidth is too small, the density estimate has too many peaks making the distribution difficult to interpret. On the other hand, if the bandwidth is too large, then the information about real distribution and subtle features of the data under analysis will be obscured. The bandwidth has to be chosen appropriately such that it highlights all the important features while maintaining smoothness.

KDE plot with low bandwidth
KDE plot with high bandwidth

Seaborn provides the kdeplot() function to plot a univariate or bivariate kernel density estimate. Lets generate a KDE plot using the dataset ‘x’ created above. The bandwidth of the kernel can be adjusted using the ‘bw’ argument.

Univariate Kernel Density Estimate

A Univariate plot is based on a single variable.

sns.kdeplot(x)
plt.title('KDE Plot')
plt.show()


Above we see a KDE plot for the dataset ‘x’. Note that the y-axis is in terms of relative frequency and not the number of data points falling into the range. The density curve has one distinct peak indicating the distribution is unimodal.

KDE plot with low bandwidth

sns.kdeplot(x,bw=0.1)
plt.title('KDE Plot with low bw')
plt.show()


We see that when the bandwidth is too low, the density curve has too many peaks and appears to be multimodal (having multiple peaks).

KDE plot with high bandwidth

sns.kdeplot(x,bw=1)
plt.title('KDE Plot with high bw')
plt.show()


If the bandwidth is too large, then smaller features in the distribution of the data may disappear. In the above plot small bumps got smoothed out obscuring important information. This is also known as over-smoothing of the curve.

Customizing the KDE plot

sns.kdeplot(y,shade=True)
plt.show()


By setting the parameter ‘shade’ to True, the area under the density curve is filled with a color.

sns.kdeplot(y,vertical=True)
plt.show()


If the parameter ‘vertical’ is set to True, density is plotted on the y-axis.

Bivariate Kernel Density Estimate

While a univariate KDE is based on one random variable, a bivariate KDE is based on two independent random variables. The kdeplot() function in Seaborn can be used to generate bivariate KDE which reveals the relationship between the two variables. The bivariate KDE has a three dimensional bell shaped appearance. Even though 3D plots are visually appealing they are difficult to read because some parts of the plot are blocked by other parts and not all applications support rotation of 3D plots.

Bivariate Normal density
image source: https://online.stat.psu.edu/

One common way of displaying information about a 3D surface by using only two dimensions is to use level curves or contour lines. In the next example we will use Contour plots to illustrate bivariate KDEs. Contour plots represent data for three variables in two dimensions. Each contour line is drawn in an xy-plane by varying the x and y values and keeping the third variable as a constant. That means each line is drawn by joining points having equal value or which have the same density. Lets now plot a bivariate KDE by passing the arguments data,data2 to the kdeplot() function which specify the x-coordinates and y-coordinates of the points to be plotted.

z = np.random.randint(3,19,50)

z
array([ 6,  9, 13,  5,  7,  3,  5,  8,  3,  7,  8, 16,  3,  8, 16, 18, 14,
       18,  5,  6,  7, 10, 17, 17, 15,  8, 12,  5, 14,  7,  6, 14,  5,  5,
        6,  4, 13, 18,  5, 16,  9, 11,  9,  4,  9,  3, 16, 17, 15, 14])
sns.kdeplot(data=y,data2=z)
plt.show()


Here is a two dimensional Kernel Density Estimate shown using contour plot. It displays the joint distributions of random variables ‘y’ and ‘z’. From the plot it is not clear as to which regions have high density or which regions have low density. So let us customize the density plot by color coding the contour lines.

Customizing the Bivariate density plot

sns.kdeplot(data=y,data2=z,cmap='Greens',n_levels=20)
plt.show()


Contour plots can use a colormap to color the different levels. We have applied “Greens” colormap to the above plot. “Greens” is a sequential colormap which is best to visualize numeric data that progresses from low to high by gradually increasing darkness and saturation of the color. By default, darker colors represent higher density values.

We can also specify the number of levels that we want to see in the contour plot. If there are too few levels in the map, important details may be lost, while too many levels makes the plot look cluttered.

sns.kdeplot(data=y,data2=z,cmap='Blues',shade=True)
plt.show()


By setting the parameter ‘shade’ to True, the area between adjacent contour lines can be filled with varying shades of a color. The darker the shade of the color, the higher is the density.

sns.kdeplot(data=y,data2=z,cmap='Blues',cbar=True,shade=True)
plt.show()


You can add a vertical colorbar to the right side of the plot by setting the parameter ‘cbar’ to True.

Modules vs Packages vs Libraries vs Frameworks

Modules vs Packages vs Libraries vs Frameworks


  Data visualization

Table of Contents

Python Module

A Python module is just a python file with a .py extension. It can contain variables or functions – elements that can be referred from another python file using an import.

For example, let’s create a simple python file – user.py in a directory.

It just has a variable name and a function nameCapitalize(name). These can be referenced in another file using the standard import process that we already know.

Say, we create another file test.py and then import user.py using

import user


That’s it, the variable name inside the user.py module can be referenced using the familiar dot (.) notation.

user.name


You can even reference the functions inside the referenced module.

user.nameCapitalize(name)


Python Package

A python package is just a collection of Python files (or modules). Earlier, we were able to import a specific python file. Now, we have created a folder called customers and inside it created another python file(or module) called customer.py.

You can have any levels of hierarchy inside a package using folders for nesting python modules.

Python Library

A python library is a collection of python packages. The python standard library for example is a collection of Python Packages. Theoritically, there is no difference between a Python Module and Python Package. Typically python library is a collection of modules or packages. For example, you can view the requests library here

Python Framework

A framework is a collection of python libraries. Some examples of popular python frameworks are

  • django – used to build web apps
  • flask – used to build APIs
  • FastAPI – used to build APIs

Heatmap

Heatmap


  Data visualization

Table of Contents

What is a Heatmap?

A Heatmap is a graphical representation of data which represents data values using colors. Heat maps make it easy to visualize complex data and are useful for identifying patterns and areas of concentration. They are used to show the relation between two variables, plotted on x and y axis. The variables plotted on each axis can be of any type, categorical labels or numerical values. When plotting categorical variables on the axis, it is good to properly order the variables based on some criteria in order to reveal the patterns in the data.

We can use categorical color palettes to represent categorical data, while numerical data requires a colour scale that blends from one colour to another, in order to represent high and low values.

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

x = np.random.rand(5,5)
sample_data = pd.DataFrame(x)
sample_data
0	1	2	3	4
0	0.104282	0.520084	0.586270	0.421520	0.924707
1	0.264854	0.518364	0.965355	0.545920	0.975186
2	0.546176	0.509287	0.142528	0.992683	0.136078
3	0.349071	0.260509	0.267000	0.135422	0.025028
4	0.845259	0.869380	0.178070	0.917876	0.344369

Consider the DataFrame – sample_data which consists of data in a tabular format. The above table has 5 rows and 5 columns. Let us visualize this data using a heatmap, that means we have to transform the values in this DataFrame into different colors. Seaborn provides the heatmap() function for this purpose.

sns.heatmap(data=sample_data)
plt.xlabel('column names-->')
plt.ylabel('index values-->')
plt.title('Heatmap with random values')
plt.show()


The Pandas DataFrame is passed to the heatmap() function and the plot is displayed using the Matplotlib show() function.

The heatmap generated above represents the numerical values in sample_data with different colors. The DataFrame index values are used as y-tick labels and the column names are used as x-tick labels. This heatmap uses dark colors to display low values and light colors to display high values.

A color bar which represents the relationship between colors and values is displayed to the right hand side of the figure. The color bar has ticks positioned at 0.2, 0.4, 0.6, 0.8. The tick positions are calculated depending on the minimum and maximum data values in the input data. The colormap used here has dark colors at one extreme and light colors at the other extreme.

Let us look into another example, by loading the built-in ‘flights’ dataset.

flights_dataset = sns.load_dataset('flights')
flights_dataset.head(15)
year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121
5	1949	June	135
6	1949	July	148
7	1949	August	148
8	1949	September	136
9	1949	October	119
10	1949	November	104
11	1949	December	118
12	1950	January	115
13	1950	February	126
14	1950	March	141

Data in the ‘flights_dataset’ DataFrame is in long form, we will reorganise the data and convert to wide form data using the Pandas pivot_table() function.

flights= pd.pivot_table(data=flights_dataset,index='month',columns='year',values='passengers')
flights
year	1949	1950	1951	1952	1953	1954	1955	1956	1957	1958	1959	1960
month												
January	112	115	145	171	196	204	242	284	315	340	360	417
February	118	126	150	180	196	188	233	277	301	318	342	391
March	132	141	178	193	236	235	267	317	356	362	406	419
April	129	135	163	181	235	227	269	313	348	348	396	461
May	121	125	172	183	229	234	270	318	355	363	420	472
June	135	149	178	218	243	264	315	374	422	435	472	535
July	148	170	199	230	264	302	364	413	465	491	548	622
August	148	170	199	242	272	293	347	405	467	505	559	606
September	136	158	184	209	237	259	312	355	404	404	463	508
October	119	133	162	191	211	229	274	306	347	359	407	461
November	104	114	146	172	180	203	237	271	305	310	362	390
December	118	140	166	194	201	229	278	306	336	337	405	432

The above data is in a format which is useful for our analysis, we will now plot a heatmap.

sns.heatmap(data=flights)
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()

The heatmap above has transformed the numerical values in the ‘flights’ DataFrame to different colors. The index values(‘month’) of the ‘flights’ DataFrame are used as y-tick labels and column names(‘year’) are used as x-tick labels. The heatmap uses dark colors to display low values and light colors to display high values.

In this example, the heatmap is generated using the default parameter values. Let us customize the appearance of the heatmap by changing the default settings.

Customize your heatmap

Example 1

sns.heatmap(data=flights,vmin=100,vmax=630,annot=True,fmt='d',linewidth=0.3,cbar=True)
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()


In the above heatmap, we have set the lower and upper bounds for the color bar, displayed numerical values in each cell and added borders to the cells. These parameters are defined below:

Parameters:

  • vmin,vmax : values to set a Colorbar range, vmin is the lower bound and vmax is the upper bound for color scaling.If no value is specified then the limits are inferred from minimum and maximum data values in input data.
  • annot : If True, the data values corresponding to each cell are displayed.
  • fmt : String formatting code to use when adding annotations.
  • linewidth : separate each cell of the heatmap using the specified linewidth.
  • cbar : can be used to display or remove colorbar from the plot, default value is True.

Example 2

sns.heatmap(data=flights,center=flights.loc['July',1957])
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()


The above heatmap is plotted by setting the parameter ‘center’ to a numerical value and this value is used as the center of the colormap when plotting data.

In this example, the value in the cell indicated by (July,1957) is the center of the colormap. If you notice carefully, this is actually the new midpoint of the data and the cells to the left of the midpoint are indicated with a color scheme that gradually goes from blue to light green. The cells to the right of the midpoint are indicated with a color scheme that gradually goes from black to red. So a divergent color scheme is applied to the heatmap.

Example 3

sns.heatmap(data=flights,center=flights.loc['July',1960])
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()


In the above heatmap, the center of colormap is set with the value indicated by cell – (July,1960). This heatmap uses light colors to display low values and dark colors to display high values. As you go from left to right extreme of the heatmap, the colors change from light to dark shades which clearly shows an increase in the number of passengers with each year. Also the number of passengers is more in the months of July and August compared to other months in any given year, this is evident from the cells corresponding to July and August which are darker compared to the cells above and below.

Example 4

sns.heatmap(data=sample_data,square=True,xticklabels=2, yticklabels=2,cbar_kws={'orientation':'horizontal'})
plt.show()


By passing True to the parameter ‘square’ you can set the shape of the cells to a square.

By default, the heatmap takes the xtick and ytick labels from the index and columns names of the Dataframe. This can be changed using the parameters xticklabels and yticklabels. These two parameters can take any of following values – “auto”, bool, list, int.

You can specify whether the colorbar is displayed horizontally or vertically by using the color bar keyword arguments(cbar_kws).

Example 5 – Masking

Say for example, you want to display only the cells above or below the diagonal of the heatmap . This can be achieved using masking. Let us display the cells above the diagonal in the heatmap using the input DataFrame sample_data.

np.triu() is a method in NumPy that returns the lower triangle of any array passed to it, while np.tril() returns the upper triangle of any array passed to it. Lets pass Dataframe sample_data to the method np.tril().

lower_triangle  = np.tril(sample_data)
lower_triangle
array([[0.10428215, 0.        , 0.        , 0.        , 0.        ],
       [0.26485421, 0.51836401, 0.        , 0.        , 0.        ],
       [0.54617601, 0.50928675, 0.14252839, 0.        , 0.        ],
       [0.34907137, 0.26050913, 0.26700028, 0.13542169, 0.        ],
       [0.84525918, 0.86937953, 0.17806983, 0.91787589, 0.34436948]])

lower_triangle is the new array formed by extracting the lower (tril) triangular part of sample_data, and setting all other elements to zero.

sns.heatmap(data=sample_data,mask=lower_triangle)
plt.show()


The parameter ‘mask’ accepts an array or DataFrame. To the heatmap() function above we have passed the data as sample_data and mask as lower_triangle which will create a mask on the heatmap. Values will be plotted for cells where ‘mask’ is ‘False’ – that is a value of 0.

Below is another example, which displays the cells below the diagonal.

upper_triangle = np.triu(sample_data)
upper_triangle
array([[0.10428215, 0.52008394, 0.58626989, 0.42152029, 0.92470701],
       [0.        , 0.51836401, 0.96535456, 0.54591957, 0.97518597],
       [0.        , 0.        , 0.14252839, 0.99268308, 0.13607794],
       [0.        , 0.        , 0.        , 0.13542169, 0.02502764],
       [0.        , 0.        , 0.        , 0.        , 0.34436948]])
sns.heatmap(data=sample_data,mask=upper_triangle)
plt.show()

Read and plot data from a csv file

The csv file contains data related to the total number of road accidents and the time of occurence in different Indian cities in 2017. The file used in this example is available in this url: https://data.gov.in/resources/stateut-wise-road-accidents-time-occurance-during-2017

Let us read the data in the csv file into a DataFrame using the Pandas read_csv() function.

input_file = pd.read_csv(r'C:\Users\Ajay Tech\Documents\training\visualization\Data\Road_Accidents_2017.csv')

input_file.tail()
States/UTs	06-09hrs (Day)	09-12hrs (Day)	12-15hrs (Day)	15-18hrs (Day)	18-21hrs (Night)	21-24hrs (Night)	00-03hrs (Night)	03-06hrs (Night)	Total Accidents
32	Daman & Diu	5	7	15	16	17	13	6	0	79
33	Delhi	747	858	828	807	1008	1159	714	552	6673
34	Lakshadweep	0	0	0	1	0	0	0	0	1
35	Puducherry	250	256	250	257	257	216	118	89	1693
36	Total	51551	71426	71594	82456	85686	49567	25050	27580	464910

The last row and column in the Dataframe contain the sum of all the numerical values in each row and column respectively. This is actually not required for our analysis, so let us delete the last row and column.

input_file.drop(36, axis=0,inplace=True)
input_file.drop('Total Accidents', axis=1,inplace=True)
input_file.tail()

States/UTs	06-09hrs (Day)	09-12hrs (Day)	12-15hrs (Day)	15-18hrs (Day)	18-21hrs (Night)	21-24hrs (Night)	00-03hrs (Night)	03-06hrs (Night)
31	D & N Haveli	7	6	12	14	14	5	5	4
32	Daman & Diu	5	7	15	16	17	13	6	0
33	Delhi	747	858	828	807	1008	1159	714	552
34	Lakshadweep	0	0	0	1	0	0	0	0
35	Puducherry	250	256	250	257	257	216	118	89

Next we will reorganize the data in the Dataframe into a format which is required for analysis using pivot_table() function.

input_data = input_file.pivot_table(index='States/UTs')
input_data.head()
00-03hrs (Night)	03-06hrs (Night)	06-09hrs (Day)	09-12hrs (Day)	12-15hrs (Day)	15-18hrs (Day)	18-21hrs (Night)	21-24hrs (Night)
States/UTs								
A & N Islands	3	7	20	27	33	41	39	19
Andhra Pradesh	1406	1648	2808	3581	3765	4484	5265	2770
Arunachal Pradesh	42	39	40	24	25	24	23	24
Assam	393	391	749	1206	1262	1340	1121	708
Bihar	495	755	1344	1550	1292	1462	1312	645

Let us generate a heatmap for the first 20 rows in the data.

sns.heatmap(data = input_data.head(20),cmap='YlOrRd',linewidths=0.3,xticklabels=['00-03','03-06','06-09','09-12','12-15','15-18','18-21','21-24'])
plt.xticks(rotation=20)
plt.xlabel('Time of occurence(hrs)')
plt.title('No. of road accidents/time interval of day')
plt.show()

The heatmap uses light colors for low values and dark colors for high values. Few cities in the heatmap show that the number of accidents are more during the interval 9AM – 9PM, for example for the last city (Madhya Pradesh) in the heatmap dark colors in the colormap are used for the cells in this interval. The maximum number of road accidents for most cities occur during 6-9 PM interval, this can be inferred from the color of the cells in this interval.

Colors in Visualization

Colors in Visualization


  Data visualization

Colors in Visualization

Data visualization is the graphical representation of data. One important step in data visualization process is mapping of numbers to colors. When mapping the numbers to colors, choosing the right color combinations is essential because charts that have colors that go well with each other make it easy to read and understand the data and the viewer can also easily perform the reverse mapping back to scalar values.

Basic Terms

Palette: In computer graphics, a palette is the set of available colors.

Hue: Hues refer to the set of pure colors within a color space. Hue defines pure colors as one of the six Primary and Secondary colors.

Saturation: Saturation refers to the intensity of color in an image, that is the strength or weakness of a color.

Lightness: Lightness of a color specifies how “bright” the color should be. 0% means the brightness is 0 and the color is black. 100% means maximum brightness and the color is white.

Color wheel: A color wheel is a visual organizational tool of color hues around a circle that help make the basic categories of color easier to understand. It shows the relationships between primary colors, secondary colors, and tertiary colors. It can be used as a point of reference when creating customized palettes.

Color model: Color models provide various methods to define colors, each model defining colors through the use of specific color components.

Source : http://warrenmars.com/visual_art/theory/colour_wheel/martian_colour_wheel_24_hue_r.png

color_palette() function

The color_palette function in Seaborn returns a list of colors defined in a color palette and these colors can be used in a plot. When the function is called without passing any arguments, it returns the current Matplotlib color cycle. The function can also be used in a with statement to temporarily set the color cycle for a plot or set of plots. The colors defined in a color palette can be displayed using the palplot() function, which plots the color palette in a horizontal array.

import seaborn as sns
current_palette = sns.color_palette()  ## The color_palette function returns a list of RGB tuples.
current_palette

[(0.4541176470588235, 0.7066666666666666, 0.6270588235294118),
 (0.8976470588235295, 0.5929411764705883, 0.4749019607843137),
 (0.5894117647058825, 0.6415686274509804, 0.7596078431372548),
 (0.8511764705882354, 0.5958823529411763, 0.7523529411764707),
 (0.632156862745098, 0.7694117647058825, 0.4070588235294117),
 (0.8776470588235292, 0.7733333333333332, 0.30666666666666687)]
sns.palplot(sns.color_palette(current_palette)) 

## palplot() function displays the colors in the current matplotlib color cycle.

The type of color palette to use in a visualization depends on the nature of the input data. Primarily three types of color palettes exist for data visualization:

  • Qualitative palettes
  • Sequential palettes
  • Diverging palettes

Let us discuss about these palettes.

Qualitative color palettes

Qualitative or Categorical palettes can be used to represent categorical data, where there is no particular ordering of categories. Each color in the palette represents a distinct category. If the number of categories in the input data is more than the number of colors in the palette, the same set of colors will be looped over.

There are six variations of the default color cycle – deep, muted, pastel, bright, dark, and colorblind. These six palettes have varying degrees of saturation and luminance.

sns.palplot(sns.color_palette(palette='deep'))
sns.palplot(sns.color_palette(palette='muted'))

sns.palplot(sns.color_palette(palette='pastel'))


Circular color systems

hsl color model

HSL is short form for Hue, Saturation and Lightness. The HSL color model defines a given color according to its hue, saturation and lightness components.

The hls_palette function in Seaborn can be used to create a palette with evenly spaced colors in HLS hue space. The values for the parameters h, l, and s ranges between 0 and 1. By default, 6 colors in the hsl color model will be returned.

The hls_pallete() function below returns evenly spaced colors beginning from hue=0(red), keeping the saturation and lightness at 0.5.

sns.palplot(sns.hls_palette(n_colors=6, h=0, l=0.5, s=0.5))


The function below returns fully saturated version of same set of colors as above by setting s=1.

sns.palplot(sns.hls_palette(n_colors=6, h=0.5, l=0.3, s=1))

Look at the colors below, the brightness of these colors is 0.8.

sns.palplot(sns.hls_palette(n_colors=10, h=0.2, l=0.8, s=0.65))

husl color model

HSLuv color space has perceptually uniform lightness, which means every color with the same lightness value is perceived as equally bright by humans. HSLuv color space, is a human-friendly alternative to the HSL color space. The husl_palette function in Seaborn can be used to create a palette with evenly spaced colors in HUSL hue space. The values for the parameters h, l, and s ranges between 0 and 1. By default, 6 colors in the husl color model will be returned.

In the below example, the husl_palette returns colors beginning from h=0, with the saturation and luminosity set at 0.6.

sns.palplot(sns.husl_palette(s=0.6,l=0.6)) 

Darker shades of the above set of colors are displayed below by setting luminosity to 0.3

sns.palplot(sns.husl_palette(s=0.7,l=0.3))

Categorical Color Brewer palettes

There are many tools available online that help us to choose and create color palettes that best suit our needs. ColorBrewer is one such tool that offers a number of interesting color palettes of each type.

The choose_colorbrewer_palette() function in Seaborn, helps us to choose palettes from the Color Brewer library. This function, which must be used in a Jupyter notebook, launches an interactive Ipython widget function to choose and customize a color palette. To the choose_colorbrewer_palette() function we need to pass the type of color palette that we want to visualize. Using the interactive widget we can browse through various palettes available and adjust the saturation parameter.

This function, by default, returns the list of colors defined in a color palette, but if you need a Matplotlib colormap that can be used with Matplotlib plotting functions you can set the argument as_cmap to True.

sns.choose_colorbrewer_palette(data_type='s')
interactive(children=(Dropdown(description='name', options=('Greys', 'Reds', 'Greens', 'Blues', 'Oranges', 'Pu…
[(0.9575547866205305, 0.9575547866205305, 0.9575547866205305),
 (0.9012072279892349, 0.9012072279892349, 0.9012072279892349),
 (0.8328950403690888, 0.8328950403690888, 0.8328950403690888),
 (0.7502191464821223, 0.7502191464821223, 0.7502191464821223),
 (0.6434140715109573, 0.6434140715109573, 0.6434140715109573),
 (0.5387158785082661, 0.5387158785082661, 0.5387158785082661),
 (0.440322952710496, 0.440322952710496, 0.440322952710496),
 (0.342883506343714, 0.342883506343714, 0.342883506343714),
 (0.22329873125720878, 0.22329873125720878, 0.22329873125720878),
 (0.10469819300269129, 0.10469819300269129, 0.10469819300269129)]

Sequential color palettes

Sequential color palettes are best to visualize numeric data that progresses from low to high (or vice versa) with light to dark colors. You can use a sequential color palette with a single hue, while changing the saturation and lightness. This allows us to focus our attention on data that have larger values(bright colors). Examples of single hue color palettes are Blues,Greens,Greys,Oranges.

sns.palplot(sns.color_palette('Blues'))

You can also reverse the order of colors in the palette by adding a suffix ‘_r’ to the palette name.

sns.palplot(sns.color_palette('Blues_r'))

sns.palplot(sns.color_palette('Greens_r'))


 If you want darker shades of these colors , then you can add a suffix ‘_d’ to the palette name.

sns.palplot(sns.color_palette('Greens_d'))

Multi-hue sequential color palettes provide a better color contrast so that the colors are easier to differentiate. When multiple hues are used it is good to choose colors that are next to each other on the color wheel because they blend well together, the parameters saturation and lightness are to be adjusted accordingly.

sns.palplot(sns.color_palette('BuGn'))
sns.palplot(sns.color_palette('PuBuGn'))

Cubehelix palette

The Cubehelix system was designed to produce attractive palettes with a huge choice of hue, saturation and brightness. Images using this palette will look monotonically increasing to both the human eye and when printed in black and white.

The cubehelix palettes are a form of the sequential color palettes in which the hue is slightly changed and the brightness is increased linearly.

As shown below, you can pass the palette name and the number of colors ‘n’ to the color_palette() function to display……

sns.palplot(sns.color_palette("cubehelix", 8))


You can call the cubehelix_palette() function through the cubehelix interface provided by Seaborn. This function can be used to generate a sequential palette.

sns.palplot(sns.cubehelix_palette())


Parameters:

  • start — hue to start from
  • rotation — number of rotations the helix makes
  • hue — saturation of the color
  • gamma — Gamma factor to emphasize low or high intensity values
  • light — Intensity of the lightest color in the palette
  • dark – Intensity of the darkest color in the palette
  • as_cmap – If True, return a matplotlib colormap instead of a list of colors
sns.palplot(sns.cubehelix_palette(start=0,rot=2,hue=0.5,gamma=0.3))
sns.palplot(sns.cubehelix_palette(start=0.5,rot=3,hue=1,gamma=1,dark=1,light=0))


It is also possible to generate a cubehelix palette in seaborn using a string-shorthand as shown below:

sns.palplot(sns.color_palette("ch:1.5,r=.4,l=.6"))

Custom sequential palettes

Seaborn comes up with custom sequential palettes such as the light_palette and dark_palette.You can generate a palette using the light_palette function by choosing a base color which blends from light to dark shade. You can use the interactive Ipython widget with choose_light_palette() function to select light palettes.

sns.palplot(sns.light_palette(color='#800080',n_colors=7,reverse=False,input='RGB'))


Parameters: color – HEX code, HTML color name of the chosen color reverse – If True, the order of colours is reversed as_cmap – If True, return a matplotlib colormap instead of a list of colors input – color space to which color values refer : {‘rgb’, ‘hls’, ‘husl’, xkcd’}

Using the dark_palette function, you can generate a palette by choosing a base color which blends from dark shade to color. You can use the interactive Ipython widget with choose_dark_palette() function to select dark palettes.

sns.palplot(sns.dark_palette(color='#b60c26',n_colors=6,reverse=False,input='hsl'))


Diverging color palettes

Diverging color schemes are best used to highlight both high and low extreme values. This color scheme is usually used for data that has a natural mid-point such as a zero. A diverging scheme shows all values below the mid-point as a sequential color scheme using one hue, and all values above the mid-point in a sequential scheme using a different hue. The hue at one extreme gradually tones down to a neutral color at the midpoint, and then the saturation of the second hue gradually increases to the other extreme.

Below we see few diverging palattes.

sns.palplot(sns.color_palette('RdBu',7))
sns.palplot(sns.color_palette('BrBG',7))


Custom diverging palettes

You can generate custom diverging palettes using the diverging_palette() function.

sns.palplot(sns.diverging_palette(h_neg=260,h_pos=350,n=7,s=75,l=50,sep=10,center='light'))


Parameters:

  • h_neg,h_pos- hues for the left and right extremes of the palette, range between 0-359.
  • s – saturation for both the extremes
  • l – lightness for both the extremes
  • sep – size of the intermediate region
  • center – the center of the palette can be ‘light’ or ‘dark’

The set_palette function is used to set the matplotlib color cycle using a seaborn palette.

sns.set_palette(palette=sns.color_palette('Set2'), n_colors=6, desat=0.7, color_codes=False)

from matplotlib import pyplot as plt
tips = sns.load_dataset('tips')
sns.barplot('sex','tip',data=tips)
plt.show()


Plot Styling and Scaling of Elements

Plot Styling and Scaling of Elements


  Data visualization

Table of Contents

Plot styling

Seaborn splits Matplotlib parameters into two independent groups: The first group sets the aesthetic style of the plot and second group scales various elements of the figure.

Let us first see how to customize the look and style of the plots. Seaborn has five built-in themes to style its plots – darkgrid, whitegrid, dark, white, and ticks. The darkgrid is the default theme, but we can change this style to suit our requirements.

We can customize the styles such as background color, color of tick marks, text color, font type etc., using the functions axes_style() and set_style(). Both these functions take same set of arguments.

The axes_style() function defines and returns a dictionary of rc parameters related to the styling of the plots. This function returns an object that can be used in a with statement to temporarily change the style parameters.

The set_style() function is used to set the aesthetic style of the plots, the rc parameters can be customized using this function.

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


Below we see the rc parameters returned by the axes_style function. The default settings of the parameters can be changed to fine tune the look of the plots.

sns.axes_style()

{'axes.facecolor': 'white',
 'axes.edgecolor': 'black',
 'axes.grid': False,
 'axes.axisbelow': 'line',
 'axes.labelcolor': 'black',
 'figure.facecolor': (1, 1, 1, 0),
 'grid.color': '#b0b0b0',
 'grid.linestyle': '-',
 'text.color': 'black',
 'xtick.color': 'black',
 'ytick.color': 'black',
 'xtick.direction': 'out',
 'ytick.direction': 'out',
 'lines.solid_capstyle': 'projecting',
 'patch.edgecolor': 'black',
 'image.cmap': 'viridis',
 'font.family': ['sans-serif'],
 'font.sans-serif': ['DejaVu Sans',
  'Bitstream Vera Sans',
  'Computer Modern Sans Serif',
  'Lucida Grande',
  'Verdana',
  'Geneva',
  'Lucid',
  'Arial',
  'Helvetica',
  'Avant Garde',
  'sans-serif'],
 'patch.force_edgecolor': False,
 'xtick.bottom': True,
 'xtick.top': False,
 'ytick.left': True,
 'ytick.right': False,
 'axes.spines.left': True,
 'axes.spines.bottom': True,
 'axes.spines.right': True,
 'axes.spines.top': True}

Scatter Plot

A scatterplot is used to display the correlation between two numerical variables. The values of both the variables are displayed with dots. Each dot on the scatterplot represents one observation from the data set.

We will plot a scatter plot which uses the ‘tips’ dataset to display the tips received on the total_bill, both of which are quantitative variables. The scatter plot is rendered in default theme which is darkgrid style.

tips = sns.load_dataset('tips')
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Default theme - Darkgrid') 
plt.show()

As you can see, the default theme has a light grey background with white gridlines.

Let’s look at another example, in the scatter plot function below, we have passed the ‘ticks’ theme to the style parameter. The ‘ticks’ theme allows the colors of the dataset to show more visibly. Apart from this, we have also changed the axes edgecolor, text color, ticks color in the plot by changing the default rc parameter values.

sns.set_style(style='ticks',rc={'axes.edgecolor': 'b','text.color': 'r','xtick.color': 'r','ytick.color': 'r'})
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Ticks theme')
plt.show()


We can temporarily change the style parameters of a plot by using the axes_style function in a with statement as shown in the example below.

with sns.axes_style(style='whitegrid',rc={'font.family': 'serif','font.serif':'Times'}):
    sns.scatterplot('total_bill','tip',data=tips)
plt.title('Whitegrid theme')
plt.show()


Scaling of plot elements

Next we will see how to scale the various elements in the plot. Seaborn has four preset contexts which set the size of the plot and allow us to customize the plot depending on how it will be presented. The four preset contexts, in order of relative size are – paper, notebook, talk and poster. The notebook style is the default context, which can be changed depending on our requirement.

We can customize the size of the plot elements such as labels,ticks,markers,linewidth etc., using the functions plotting_context() and set_context(). Both these functions take same set of arguments.

The plotting_context() function defines and returns a dictionary of rc parameters related to plot elements such as label size,tick size,marker size. This function returns an object that can be used in a with statement to temporarily change the context parameters.

The set_context() function is used to set the plotting context parameters.

Below we see the rc parameters returned by the plotting_context function. The default settings of the parameters can be changed to scale plot elements.

sns.plotting_context()
{'font.size': 12.0,
 'axes.labelsize': 12.0,
 'axes.titlesize': 12.0,
 'xtick.labelsize': 11.0,
 'ytick.labelsize': 11.0,
 'legend.fontsize': 11.0,
 'axes.linewidth': 1.25,
 'grid.linewidth': 1.0,
 'lines.linewidth': 1.5,
 'lines.markersize': 6.0,
 'patch.linewidth': 1.0,
 'xtick.major.width': 1.25,
 'ytick.major.width': 1.25,
 'xtick.minor.width': 1.0,
 'ytick.minor.width': 1.0,
 'xtick.major.size': 6.0,
 'ytick.major.size': 6.0,
 'xtick.minor.size': 4.0,
 'ytick.minor.size': 4.0}

The scatter plot below uses the ‘tips’ dataset to display the tips received on the total_bill. The scatter plot is rendered in the notebook context which is the default context.

sns.set()
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Notebook context')
plt.show()


In the example below, we have passed ‘talk’ to the context parameter. Apart from this, we have also changed the label size, title size, grid linewidth of the plot by changing the default rc parameter values.

sns.set_context(context='talk',rc={'axes.labelsize': 20.0,'axes.titlesize': 20.0,'grid.linewidth': 2.5})
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Talk context')
plt.show()

We can temporarily change the context parameters of a plot by using the plotting_context function in a with statement as shown in the example below.

with sns.plotting_context(context='poster'):
    sns.scatterplot('total_bill','tip',data=tips)
plt.title('Poster context')
plt.show()


Set () Function

The set() function in Seaborn sets the style and context parameters in one step, see example below.

sns.set(style='white',context='paper',rc={'axes.edgecolor': 'b','axes.titlesize': 20.0})
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Paper context')
plt.show()


If you want to switch to Seaborn default settings, then call the set() function without passing any arguments.

Default settings:

  • context=’notebook’
  • style=’darkgrid’
  • palette=’deep’
  • font=’sans-serif’
  • font_scale=1
  • color_codes=True
sns.set()
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Plot with default settings')
plt.show()

Despine() Fundespine()ction

Spines are the borders on the sides of a graph or plot. By default, a plot has four spines. The despine() function can be used to remove the spines in the plot, by default the top and right spines are removed using this function.

sns.set(style='ticks')
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Plot with four spines/borders')
plt.show()
sns.set(style='ticks')
sns.scatterplot('total_bill','tip',data=tips)
sns.despine()
plt.title('Plot with top and right spines removed')
plt.show()


You can choose to remove all the spines if you think they are unnecessary and distracting, see example below.

sns.set(style='white')
sns.scatterplot('total_bill','tip',data=tips)
sns.despine(left=True,bottom=True)
plt.title('Plot with all spines removed')
plt.show()

Introduction to Seaborn

Introduction to Seaborn


  Data visualization

Table of Contents

Introduction to Seaborn

Seaborn is a data visualization library which provides a high-level interface to draw statistical graphs.  It is built on top of Python’s core visualization library, Matplotlib. Seaborn extends the Matplotlib library for creating aesthetically pleasing graphs. Internally Seaborn uses Matplotlib to draw plots, so it complements the Matplotlib library but is not a replacement to it. Matplotlib is highly customizable, but it is hard to know what settings to tweak to render nice plots. Seaborn comes with a number of customized themes and a high-level interface for controlling the look of Matplotlib figures.   

Seaborn comes with preset styles and color palettes which can be used to create aesthetically pleasing charts with few lines of code. It is closely integrated with the Pandas and Numpy library.

Below are the dependencies of the Seaborn library:

  • Python 3.6+
  • numpy (>= 1.13.3)
  • scipy (>= 1.0.1)
  • pandas (>= 0.22.0)
  • matplotlib (>= 2.1.2)

Once the required dependencies are installed, you are ready to install and use Seaborn.

The latest version of Seaborn can be installed using pip with the command — pip install seaborn

You can also install Seaborn using Anaconda prompt with the command — conda install seaborn

Seaborn is closely integrated with Pandas data structures. The Pandas library has two primary containers of data – DataFrame and Series.

DataFrames – A DataFrame is a collection of data arranged in rows and columns. DataFrames are similar to excel spreadsheets. They are two-dimensional structures, with two axes, the “index” axis and the “columns” axis.

Series – Series is a single column of the DataFrame. So a Pandas DataFrame is a collection of Series objects.

Basic Terms

Quantitative and Qualitative variables

In statistics two types of variables are used: Quantitative and Qualitative variables.

Quantitative: Quantitative variables are numerical values representing counts or measures. Examples: Temperature counts, percents, weight. Quantitative variables are of two types – discrete and continuous:

Discrete variables are numeric variables that have a finite number of values between any two values. A discrete variable is always numeric.

Continuous variables are numeric variables that have an infinite number of values between any two values.

Qualitative: Qualitative variables are variables that can be placed into distinct categories according to some attributes or characteristics. They contain a finite number of categories or distinct groups. Examples: Gender, eye color.

Univariate and Bivariate Data

Statistical data are classified according to the number of variables being studied.

Univariate data: This type of data consists of only one variable. The variable is studied individually and we don’t look at more than one variable at a time.

Bivariate data: This type of data involves two different variables, where the two variables are studied to explore the relationship or association between them.

Loading Datasets

Seaborn comes with a few important datasets that can be used to practice. When Seaborn is installed, the datasets are downloaded automatically. To start working with a built-in Seaborn data set, you can make use of the load_dataset() function. By default, the built-in datasets are loaded as Pandas DataFrame. Let us load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time.

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

tips = sns.load_dataset('tips')

tips.head(10)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
5	25.29	4.71	Male	No	Sun	Dinner	4
6	8.77	2.00	Male	No	Sun	Dinner	2
7	26.88	3.12	Male	No	Sun	Dinner	4
8	15.04	1.96	Male	No	Sun	Dinner	2
9	14.78	3.23	Male	No	Sun	Dinner	2

Bar Plot

A Bar Plot is a visual tool that uses bars to compare data among different groups or categories. The graph represents categories on one axis and discrete values on the other. Let us draw a bar plot using the barplot function defined in Matplotlib library with input from tips dataset.

plt.bar(x=tips['smoker'],height=tips['tip'],data=tips)
plt.xlabel('smoker')
plt.ylabel('total_bill')
plt.title('Tips/Smoker')
plt.show()

plt.bar(x=tips['smoker'],height=tips['tip'],data=tips)
plt.xlabel('smoker')
plt.ylabel('total_bill')
plt.title('Tips/Smoker')
plt.show()

The bar plot above is plotted in Matplotlib, the bars compare the tips received from two groups – Smokers and Non-Smokers. Smokers is a qualitative variable. The Series data is passed to the axes arguments.

Let us again plot a barplot using the Seaborn library as shown below.

sns.barplot(x='smoker',y='tip',data=tips)
plt.title('Tips vs Smoker')
plt.show()


The bar plot above is plotted in Seaborn. To the barplot function we have passed the column names to the x and y parameters, the Dataframe is passed to data parameter. The bars like in the previous example, compare the tips received from two groups – Smokers and Non-Smokers. Notice how the bars are displayed in different colors, also the axes labels are taken from the input data. We can add custom labels to the plot by calling set_xlabel() and set_ylabel functions on the axes object. You can also set the labels using the xlabel() and ylabel() functions defined in the pyplot module of the Matplotlib library.

fig1,axes1 = plt.subplots()
sns.barplot(x='smoker',y='total_bill',data=tips,hue='sex',estimator=np.sum,errcolor='r',errwidth=0.75,capsize=0.2,ax=axes1,)
axes1.set_xlabel('Smoker - Yes/No')
axes1.set_ylabel('Bill amount')
axes1.set_title('Total bill vs Smoker')
plt.show()

We can also create a figure object with multiple axes and render the plots onto a specific axes by using the ‘ax’ argument. If we do not specify any value for the argument, plot is rendered to the current axes.

The ‘hue’ parameter can be used to show information about the different sub-groups in a category. In the above example, the ‘hue’ parameter is assigned to the column ‘sex’ which further categorizes the data and has created two side by side bars. A separate colored bar represents each sub-group and a legend is added to let us know what each sub-category is.

The ‘estimator’ argument can be used to change how the data is aggregated. By default, each bar of a barplot displays the mean(np.mean) value of a variable. Using the estimator argument this behaviour can be changed. The estimator argument can receive a function such as np.sum, len, np.median or any other statistical function.

The red colored cap-tipped lines that extend from the edge of the bars are called Error Bars and they provide an additional layer of detail in the plotted data. Error Bars help to indicate estimated error or uncertainity of a data point. A short Error Bar shows that values are concentrated, indicating that the plotted average value is more likely, while a long Error Bar would indicate that the values are more spread out and less reliable. 

Decision Trees

Decision Trees

Contents

What are Decision Trees

Let me take you back to the number guessing game that we have played on day 1 of the course. It is a simple game where the computer chooses a random number between 1 and 100 and you have to guess the number. After each guess, the program helps you by telling if your guess is higher or lower than the chosen number. Say the number chosen is 60. Let’s visualize this.

Basically, it is a series of decisions based on the clue you get from the program. For lack of a better intelligence, we just predict the middle number on either side ( higher or lower ). We can think of the same process using a decision tree.

A decision tree is essentially a series of decisions that are based on the data you are working with. For example, if you are guessing a number between 1 and 1000, the decision tree would have been much bigger. In this case, the guesses (cuts in the number line) are exactly in the middle – for lack of a better guessing method. However, a real decision tree makes a much more informed decision. Once again, let me show this with a simple example.

Take an apple that is rotten somewhere at the side.

Our goal is to find a series of cuts that maximises the fresh apple portion (and minimizes the rotten portion) with the least possible cuts. How would you do it ?

Something like this – The criteria you would be using to make the cuts is based on the maximum area(volume) that you can carve off that is not rotten.

Decision trees also work the same way. For example, let’s take the iris dataset. To make things simple, let’s just focus on

  • setosa and versicolor
  • sepal length and sepal width.

If you are asked to carve out one species from another using just horizontal and vertical lines, how would you do it ? It’s not an easy job to do it efficiently. Probably, we would do it something like this.

What were we basing our decisions (cut-off points) on ? Visually, we were essentially eye-balling to minimize the mix of species(or maximize the grouping of a single species) in such a way that more of a specific species fell on one side than the other.

Decision tree algorithms do just this – except that they use a bit of math to do the same. Scikit learn provides two cost functions for this

  • Gini Index ( default )
  • Entropy

We will start with the basic implementation and then we will focus on understand Gini Index in a bit more detail.

Implementation

library(rpart)
names(iris)

'Sepal.Length'
'Sepal.Width'
'Petal.Length'
'Petal.Width'
'Species'
model <- rpart(Species ~ Sepal.Length + Sepal.Width, method="class", data=iris[1:100,])

plot(model)
text(model, use.n = TRUE)


Visualization

One of the biggest advantages of Decision Trees is that the whole process is very intuitive to humans. It is more or less like a white-box ( as opposed to other methods like Neural Nets that are like blackboxes – We just can’t make sense of the weights and layers ). A useful method to understand Decision Trees is to visualize them. To do that, we have to install the graphviz package. Let’s do that first.

With this function, some of the text gets split off. So, a better representation would be to create a ps or postscript file, which can be either viewed directly or converted to a pdf to be viewed.

post(model, file = "tree.ps",
     title = "Iris (Setosa/Versicolor) simple decision tree ")


You can either use a ps file viewer or convert it to pdf and view it.

y_pred = predict ( model, newdata = iris[1:100,], "class")
library(caret)

cm = confusionMatrix(y_pred,iris[1:100,5])
cm

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         48          3         0
  versicolor      2         47         0
  virginica       0          0         0

Overall Statistics
                                          
               Accuracy : 0.95            
                 95% CI : (0.8872, 0.9836)
    No Information Rate : 0.5             
    P-Value [Acc &gt; NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9             
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 0.9600            0.9400               NA
Specificity                 0.9400            0.9600                1
Pos Pred Value              0.9412            0.9592               NA
Neg Pred Value              0.9592            0.9412               NA
Prevalence                  0.5000            0.5000                0
Detection Rate              0.4800            0.4700                0
Detection Prevalence        0.5100            0.4900                0
Balanced Accuracy           0.9500            0.9500               NA

Gini Index

Let’s analyze how the Decision Tree algorith has made these decisions. First, let’s create a scatter plot of our data.

iris_new = iris[0:100,]
plot(iris_new$Sepal.Length, iris_new$Sepal.Width, 
     col = iris_new$Species, pch = 19, 
     xlab = "Sepal Length",
     ylab = "Sepal Width")

The key parameters used by Decision Tree are either of the following

  • gini index
  • entropy

By default DecisionTreeClassifier uses the gini index to calculate the cut-offs. Let’s focus on the gini index cost function.

Let’s look at the first cut ,

sepal length (cm) < = 5.45


Let’s do some calculations by hand. It will give us a better understanding of what is going on under the hood.

Formula to calculate Gini index is

where pi is the probability of occurance of the i’th class. In our case, we have just 2 classes.

  • setosa
  • versicolor

The above visual demonstrates how the calculations have been done.

Initial gini index.

Gini indices after the first cut has been made.

so, gini index after the split is

0.208 + 0.183 = 0.391


which is less than the original gini index that we started with – 0.5

Now, the question arises, why did the first cut happen at sepal length <= 5.45 ? Why not at 6.0 ? To understand this, let’s actually make a cut at sepal length <= 6.0 and re-calculate the gini indices.

  • Identify the target counts by class
table(iris_new [iris_new$Sepal.Length <= 6,]$Species)

setosa versicolor  virginica 
        50         30          0 
  • Calculate the Gini Index

The gini index at the new cut-off sepal length <= 6.0 is 0.468. It is not much different from where we initially started (0.5). By now, you should be able to understand the reasons behind the classifier’s decision points.

Challenge

Try to calculate the gini index by hand(like above) when the sepal width <=2.75

Here is a visual of how decision tree algorithm has eventually solved the problem.

Now that we understand how decision trees work, let’s try and predict some data. Let’s first split our data into train/test datasets.

data = iris_new

index = sample(1:nrow(data),nrow(data)*.8)
train = data[index,]
test = data[-index,]
library(rpart)

model <- rpart(Species ~ Sepal.Length + Sepal.Width, method="class", data=train)
y_pred = predict ( model, newdata = test, "class")

library(caret)

cm = confusionMatrix(y_pred,test[,5])
cm

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa          7          3         0
  versicolor      0         10         0
  virginica       0          0         0

Overall Statistics
                                          
               Accuracy : 0.85            
                 95% CI : (0.6211, 0.9679)
    No Information Rate : 0.65            
    P-Value [Acc &gt; NIR] : 0.04438         
                                          
                  Kappa : 0.7             
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.7692               NA
Specificity                 0.7692            1.0000                1
Pos Pred Value              0.7000            1.0000               NA
Neg Pred Value              1.0000            0.7000               NA
Prevalence                  0.3500            0.6500                0
Detection Rate              0.3500            0.5000                0
Detection Prevalence        0.5000            0.5000                0
Balanced Accuracy           0.8846            0.8846               NA

Thats an accuracy score of 88%. Pretty decent.

Decision Trees for Regression

Although decision trees are mostly used for classification problems, you can use them for regression as well.

Let’s try to fit the Boston Housing dataset with decision trees. Just to make things simple, let’s just use the LSTAT predictor to predict the target.

library(mlbench)
data(BostonHousing)
boston        = BostonHousing
index = sample(1:nrow(boston),nrow(boston)*.8)
train = boston[index,]
test = boston[-index,]
model = train(medv ~ lstat, 
              data = train, 
              method = "rpart", 
              trControl = trainControl("cv", number = 15), 
              tuneLength = 15)
Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
"There were missing values in resampled performance measures."
y_pred = predict ( model, newdata = test)

# Use the in-built "score" function of the regressor to calculate the R-squared
RMSE(y_pred, test$medv)
5.38848765509317

Let’s plot the predictions on top of the data to visually see how well the prediction does in comparision to the actual data.

plot( test$lstat, test$medv, pch = 19, col="blue")
points ( test$lstat, y_pred, pch = 19, col = "green")

As you can see from this plot, Decision Tree’s prediction for regression is step wise (as opposed to being smooth). This is because, decision trees work by averaging the data to be predicted to the nearest neighbors. You can try this by changing the parameters in the model, like trainControl and tuneLength. Also, instead of parameters like Gini Index or Entropy, Decision trees use RMSE to calculate the splits in case of regression.

Overfitting

One of the main drawbacks of Decision Trees is overfitting. We can very well observe that the model fits the training data 100%, but when it comes to test data, there would be a huge variation. Such a large variation is something that you do NOT observe in other models – say linear regression. Let’s try and fit the iris data again (this time with all the 3 species).

index = sample(1:nrow(iris),nrow(iris)*.8)
train = iris[index,]
test = iris[-index,]

library(caret)
library(rpart)


test_score     = list()
train_score = list()

for (i in 1:100) {
    
    index = sample(1:nrow(iris),nrow(iris)*.8)
    train = iris[index,]
    test  = iris[-index,]

    model        = rpart(Species ~ ., method="class", data=train)
    y_pred_train = predict ( model, newdata = train, "class") 
    y_pred_test  = predict ( model, newdata = test, "class")
    
    cm_train       = confusionMatrix(y_pred_train,train[,5])
    accuracy_train = cm_train$overall["Accuracy"]
    
    cm_test       = confusionMatrix(y_pred_test,test[,5])
    accuracy_test = cm_test$overall["Accuracy"]
    
    train_score[[i]] = accuracy_train
    test_score[[i]]  = accuracy_test
}
plot ( 1:100, train_score, pch = 19, col = "blue",
       xlab = "Number of times the model has predicted",
       ylab = "Score")
points ( 1:100, test_score, pch = 19, col = "red")
legend ( x= "topleft", legend = c("Train", "Test"), col=c("red", "blue"), pch=19, cex=0.8)


As you can see, the training accuracy is fixed, but the test results are all over (although the range is quite limited in this case). What is essentially happening is that the model is trying to learn the noise as well. One solution to this is to limit the tree size. This is called pruning. One solution is to find out a parameter called Complexity Parameter or cp. It is used as a cut-off parameter to identify the minimum value needed at a decision tree node to identify if it should go forward with another split or not. It is based on the cost of the entire tree so far ( with all its splits ). So, a simple cutoff can be used to identify at what level (of tree size) should the tree be pruned.

Luckily, rpart can give us a graph of cp at different tree sizes. plotcp can also plot this for us.

Tree Depth

Look at the depth of the decision tree for iris data.

In this decision tree, after a tree depth of 3, there is no real value addition. It’s basically nitpicking at the small numbers – and that’s exactly what is leading to overfitting. What is we can restrict the tree to just 3 levels of depth ?

model <- rpart(Species ~ ., method="class", data=iris)
plotcp(model)


As you can see from the plot above, beyond a tree size of tree (at cp value of 0.66), the tree starts to overfit the data. The relative accuracy (on the y-axis) is computed using cross validation. Let’s prune the tree at a cp of 0.66.

Pruning – Solution to Overfitting

Overfitting is basically a problem where the model tries to fit all of the training data. Since there are many borderline cases, it is not practical to fit all the data points for any ML model. In order to avoid this, we have to prune (cut off some of it’s branches) the tree to make it an a better fit for the training data – rather than a 100% fit. There are 2 ways to prune a tree.

  • Pre-Pruning – Prune the decision tree while it is being created.
  • Post-Pruning – Prune the decision tree after the entire tree has been created.
model_new = prune(model, cp = 0.066)


Let’s plot the decision tree of both the old and new (pruned) model to see how they perform.

plot(model_new, uniform = TRUE)
text(model_new, use.n = TRUE, cex = 0.6) # cex controls scaling

plot(model, uniform = TRUE)
text(model, use.n = TRUE, cex = 0.6) # cex controls scaling

As you can see from the numbers, the actual seperation of species is really small to warrant a new branch.

printcp(model)

Classification tree:
rpart(formula = Species ~ ., data = train, method = "class")

Variables actually used in tree construction:
[1] Petal.Length Petal.Width 

Root node error: 77/120 = 0.64167

n= 120 

        CP nsplit rel error   xerror     xstd
1 0.545455      0  1.000000 1.103896 0.064664
2 0.402597      1  0.454545 0.454545 0.064664
3 0.012987      2  0.051948 0.090909 0.033343
4 0.010000      3  0.038961 0.103896 0.035487

If you run the same plot and see how different the training and the test data looks, you will get an understanding of how we were able to prevent overfitting.

Naive Bayes Classifier

Naive Bayes Classifier


  Data visualization

Contents

What is Naive Bayes

Say you get an email like so,

From : njlotterries1234@gmail.com
Subject : You won Lottery
Body : Congratulations !!! You won a lottery of 5 Million dollars. Click here to claim..

What do you think of this ? Is this a spam e-mail or not ? In all probability this is spam. How do you know it ? Well, you look at the index words – words like “lottery” , “viagra” , “free”, “money back”. When you see these words, generally you tend to classify that message as spam. This is exactly how Naive Bayes works. Let’s formalize our understanding a bit by going a bit deeper.


Bayes Theorem & Conditional Probability

Before we get into “Naive” Bayes, we have to first understand Bayes theorem. To understand Bayes theorem, we have to first understand something called Conditional Probability. What exactly is it ?

Say there is a standard deck of cards and you draw a card at random.

  • What is the probability that it is a red card ?
  • What is the probability that it is a face card, given that it is a red card ?

This is called conditional probability. Bayes theorem is an alternate way to compute the same thing.

Now, let’s calculate each one of these probabilities.

  • Probability of face card P(A)
  • Probability of a red card
  • Probability of a red card , given it is a face card.
  • And finally, we calculate the probability of a face card, given its a red card P ( face | red )

What did we achieve here ? Looks like we have made things more complicated, right ? I agree with you. In fact, this formula is not all that useful in machine learning. But there is an assumption that makes this formula extraordinarily useful in ML. Let’s go back to the email example.

Again, not very useful. To calculate the probability of “You won lottery” is very arbitrary. You cannot calculate the probability of occurrence of all different phrases or combination of words. The next time around / the subject line might say “Congratulations!! You won lottery” -which is slightly different from ‘ ‘You won lottery” . Point being, you cannot possibly Calculate all different combination of words that could result from the use of all different words in the English dictionary.

Naive Bayes

This is where the Bayes theorem becomes Naive . Let’s revisit the formula again.

The probability of the word “You” occurring in the email is independent of the Lord ‘ “Won” occurring. eg.,

  • Do you have the paper with you ?
  • we have won the contract

These Sentences are completely independent. When we break down the event into the respective independent events, probability can be Simplified as follows.

This is actually a “Naive” assumption – because in reality, there is some level of overlap. Meaning, when you mention the word “lottery”, you almost always use the word “win” or some variant-like ”won'” or “winning” . However, this is where ML is lucky. Even with the naive assumption, results are pretty good with text classification in real life. Let’s apply the simplification to the Bayes theorem once again.

With a bit of naivety, this formula became so much more useful. In fact, it makes it so useful that Naive Bayes is almost exclusively used for most text classification tasks. Let’s explore this example with some rough data – just believable, made-up data.

  • Probability of “You won lottery” being spam.
  • Probability of “You won spam” as NOT spam.

So, the probability of this phrase not being spam is 1.51.

Pretty effective, right? Especially given the simplification. Calculating the probability of the individual words is easy. The heart of this algorithm is, given any sentence, this algorithm can break it down into it’s components (words) and based on the “spamminess” of each of the words, the entire sentence can be classified as spam or not.

All we are trying to do in Naive Bayes, is to break down a complicated problem into its components. Once the component is classified, essentially the bigger piece is classified as well.

It is like solving a jigsaw puzzle. How do you solve one typically ? You look for smaller puzzles to solve. Say this is a picture of a car – you start to look for smaller components of the car, like a tire, a windshield and solve for each of these separately. Once you got the pieces figured out, all you have to do is to put them in order. Naive Bayes works more or less like this.

Classify fruits based on Characteristics

Now that we understand the basics of Naive Bayes, let’s create a simple dataset and solve it in excel. The purpose behind this exercise is to get familiar with Naive Bayes calculation using a smaller dataset. This is going to solidify our understanding a bit further, before we dive into more complicated examples.

Solve the fruits dataset in excel

The probability of each of the characteristics – round, large, small etc, can be calculated as below.

Now, let’s move on to the individual conditional probabilities. For example, what is the probability that a fruit is round, given that it is an apple ? In all the cases of Apple, the fruit is always round.

However, what is the probability that a fruit is red, given that its an apple ? one out of three apples are red.

Like that, we keep calculating the conditional probabilities of all the individual characteristics. Think of this like calculating the probability of each individual word being spam or not.

Time to test our data. Let’s say, we want to calculate the probability of a fruit being an Apple, if it is round and large. All we have to do is plug the numbers.

What is the probability that a fruit is an apple, if it is round, large and smooth ?

Based on our little dataset, we are not doing too bad. let’s do the opposite now. What is the probability of a fruit being a grape, given that it is round, large and smooth ?

Makes sense, right ? grape is never “large”. Hence the probability of a fruit being a grape if it is “large” is relatively small – 16 %.

Solve the fruits dataset in Python

library(e1071)

fruits = read.csv("./data/fruits.csv")

fruits
fruit	round	large	small	red	green	black	golden	yellow	smooth	rough
<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;
apple	yes	yes	no	yes	no	no	no	no	yes	no
apple	yes	yes	no	no	yes	no	no	no	yes	no
apple	yes	yes	no	no	no	no	yes	no	yes	no
grape	yes	no	yes	yes	no	no	no	no	yes	no
grape	yes	no	yes	no	yes	no	no	no	yes	no
grape	yes	no	yes	no	no	yes	no	no	yes	no
melon	yes	yes	no	no	yes	no	no	no	yes	no
melon	yes	yes	no	no	no	no	yes	no	no	yes
melon	yes	yes	no	no	no	no	no	yes	no	yes
model = naiveBayes(fruit ~ . , data = fruits)

pred = predict ( model , fruits[,2:11 ])
pred
apple
apple
apple
grape
grape
grape
apple
melon
melon
table(pred, fruits[,1])
pred    apple grape melon
  apple     3     0     1
  grape     0     3     0
  melon     0     0     2

That’s not bad, given such a small set of characteristics. Let’s actually get the confusion matrix to get the accuracy percentage.

library(caret)

cm = confusionMatrix(pred,as.factor(fruits[,1]))
cm

Confusion Matrix and Statistics

          Reference
Prediction apple grape melon
     apple     3     0     1
     grape     0     3     0
     melon     0     0     2

Overall Statistics
                                          
               Accuracy : 0.8889          
                 95% CI : (0.5175, 0.9972)
    No Information Rate : 0.3333          
    P-Value [Acc &gt; NIR] : 0.0009653       
                                          
                  Kappa : 0.8333          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: apple Class: grape Class: melon
Sensitivity                1.0000       1.0000       0.6667
Specificity                0.8333       1.0000       1.0000
Pos Pred Value             0.7500       1.0000       1.0000
Neg Pred Value             1.0000       1.0000       0.8571
Prevalence                 0.3333       0.3333       0.3333
Detection Rate             0.3333       0.3333       0.2222
Detection Prevalence       0.4444       0.3333       0.2222
Balanced Accuracy          0.9167       1.0000       0.8333

That’s an accuracy of almost 90%. We are not very far off, given our dataset is pretty small. The one place where we went wrong is in classify a melon wrongly as an apple. If we compared the predictions vs the actuals, we can see that we went wrong with the 7th entry ( a melon being mis-classified as an apple ).

predict = pred
actual = fruits[,1]

data.frame(predict,actual)

predict	actual
<fct&gt;	<fct&gt;
apple	apple
apple	apple
apple	apple
grape	grape
grape	grape
grape	grape
apple	melon
melon	melon
melon	melon

Let’s check out the actual entry.

As you can see, the entry for melon ( watermelon ) coincides in its data points to the green apple. How could this happen ? This is because of an oversimplification with regards to size. We only have 2 sizes – small and large. However, both the apple and water melon are large ( and round and smooth ). And that’s why the NB algorithm got it wrong. If we had an extra size characteristic ( say XL ), that would have solved this problem.


Classify messages as Spam

Now that we understood the basics of Naive Bayes along with a simple example in excel and R, we can proceed to solve the problem that we started with – To classify a message as spam or not.


Step 1 – Get the dataset

There is a simple SMS ( text message ) dataset available at kaggle or at the UCI ML datesets. You can also download the file from Ajay Tech’s github page. Download the zip file and open it in excel as a tab delimited format. Each of these messages have been classified as either spam or ham ( ham is just a technical word for “non-spam” ). Open the dataset in excel as a tab-delimited format and give column names ( if not available already ).

Step 2 – Read the dataset into R

data = read.csv("./data/spam.csv", encoding='ISO-8859-1')
head(data)
class	message	X	X.1	X.2
<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;
1	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...			
2	ham	Ok lar... Joking wif u oni...			
3	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C's apply 08452810075over18's			
4	ham	U dun say so early hor... U c already then say...			
5	ham	Nah I don't think he goes to usf, he lives around here though			
6	spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv			
data = data[,c(1,2)]
head(data)
class	message
<fct&gt;	<fct&gt;
1	ham	Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...
2	ham	Ok lar... Joking wif u oni...
3	spam	Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&amp;C's apply 08452810075over18's
4	ham	U dun say so early hor... U c already then say...
5	ham	Nah I don't think he goes to usf, he lives around here though
6	spam	FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, £1.50 to rcv

Step 3 – Simple EDA

  • How many messages are there in the dataset ?
nrow(data)
5572
summary(data$class)

ham
4825
spam
747
  • Out of them, count the occurances of spam vs ham(non-spam)
  • What percentage of this is spam ?
summary(data$class)["spam"] / summary(data$class)["ham"] * 100
spam: 15.4818652849741

15 % of the messages are spam.


Step 4 – Feature Engineering

Just like we converted the fruits dataset’s feature values from “yes” or “no” to a 1 or 0 , Naive Bayes (or for that matter most ML algorithms) need the feature data to be numeric in nature. In order to do it, we have to use some techniques from Natural language processing.

  • Tokenize the message (into words) and create a sparse matrix

This process basically splits the sentence (message) to it’s individual words. Let’s see a sample before we tokenize the entire dataset.

Now, let’s do the same on our real messages dataset.

library(tm)
message_corpus = Corpus(VectorSource(data$message))
print ( message_corpus)
<<SimpleCorpus&gt;&gt;
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 5572

message_dtm <- DocumentTermMatrix(message_corpus)

Document term matrix (DTM) is in a binary format. So, we can’t just print it out using indices. Instead, we use the inspect ( ) function.

inspect(message_dtm[1:10,1:20])

<<DocumentTermMatrix (documents: 10, terms: 20)&gt;&gt;
Non-/sparse entries: 21/179
Sparsity           : 90%
Maximal term length: 19
Weighting          : term frequency (tf)
Sample             :
    Terms
Docs amore available buffet... bugis cine crazy.. got great jurong there
  1      1         1         1     1    1       1   1     1      1     1
  10     0         0         0     0    0       0   0     0      0     0
  2      0         0         0     0    0       0   0     0      0     0
  3      0         0         0     0    0       0   0     0      0     0
  4      0         0         0     0    0       0   0     0      0     0
  5      0         0         0     0    0       0   0     0      0     0
  6      0         0         0     0    0       0   0     0      0     1
  7      0         0         0     0    0       0   0     0      0     0
  8      0         0         0     0    0       0   0     0      0     0
  9      0         0         0     0    0       0   0     0      0     0

Step 5 – Train/Test data split

Before we use the DTM as-is, we have to convert the 0,1’s to Factors – like a Yes and No. This is becuase Naive Bayes works well with Factors. Let’s write a small functiont that converts all values greater than 0 to a Yes and otherwise to No.

counts_to_factor = function(x){
  x = ifelse(x &gt; 0, 1, 0)
  x = factor(x, levels = c(0,1), labels = c("No", "Yes"))
  return (x)
}


Before we apply this function to the DTM, let’s split the data into training and test datasets.

head(msg_train_dtm[,1:5])

index = sample(1:nrow(data),nrow(data)*.8)
train = data[index,2]
test = data[-index,2]

msg_cor_train      = Corpus(VectorSource(data[train,]$message))
msg_train_dtm      = DocumentTermMatrix(msg_cor_train)
msg_train_dtm      = apply(msg_train_dtm, MARGIN = 2, counts_to_factor)
msg_class_train    = data$class[train]


msg_cor_test       = Corpus(VectorSource(data[test,]$message))
msg_test_dtm       = DocumentTermMatrix(msg_cor_test)
msg_test_dtm       = apply(msg_test_dtm, MARGIN = 2, counts_to_factor)
msg_class_test     = data$class[test]
head(msg_train_dtm[,1:5])

2wks	87077	87077:	club	free
1	Yes	Yes	Yes	Yes	Yes
2	No	No	No	No	No
3	No	No	No	No	No
4	No	No	No	No	No
5	No	No	No	No	No
6	No	No	No	No	No
msg_train_df = as.data.frame(as.matrix(msg_train_dtm))
msg_test_df  = as.data.frame(as.matrix(msg_test_dtm))

head(msg_train_df)
been	curtsey?	have	practising	you	your	off.	pissed	pretty	whatever,	...	not..tel	clearer..	sections	above	da..al	coins	factory	chart	heroes,	tips
<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	...	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;
1	1	1	1	1	1	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
2	0	0	0	0	0	0	1	1	1	1	...	0	0	0	0	0	0	0	0	0	0
3	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
4	0	0	0	0	0	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
5	0	0	0	0	0	0	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0
6	0	0	0	0	3	1	0	0	0	0	...	0	0	0	0	0	0	0	0	0	0

Step 6 – Model the data

library(e1071)
model = naiveBayes(msg_train_dtm, msg_class_train)

Step 7 – Evaluate the model.

pred = predict(model, msg_test_dtm)
table(msg_class_test, pred)

pred
msg_class_test ham spam
          ham  950   13
          spam  18  134 


Measure the accuracy using the confusion matrix from the caret library.

library(caret)

cm = confusionMatrix(pred,msg_class_test)
cm

Confusion Matrix and Statistics

          Reference
Prediction  ham spam
      ham  2385   46
      spam   12  343
                                          
               Accuracy : 0.9792          
                 95% CI : (0.9732, 0.9842)
    No Information Rate : 0.8604          
    P-Value [Acc &gt; NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9101          
                                          
 Mcnemar's Test P-Value : 1.47e-05        
                                          
            Sensitivity : 0.9950          
            Specificity : 0.8817          
         Pos Pred Value : 0.9811          
         Neg Pred Value : 0.9662          
             Prevalence : 0.8604          
         Detection Rate : 0.8561          
   Detection Prevalence : 0.8726          
      Balanced Accuracy : 0.9384          
                                          
       'Positive' Class : ham             
                                   

There is scope for a ton of optimization here like

  • convert all characters to lower case
  • remove punctuation
  • remove stop words etc

But that is a subject for another day. Here we will just focus on learning the Naive Bayes algorithm.

Challenge

Let’s solve another problem in Naive Bayes. Load up a dataset called house-votes-84.csv from the data folder. The data set should look like this.

These are the results from Congressmen in the US, voting a Yes ( for ) or No (Against ) on 16 different issues. Instead of putting names, the class column identifies the congressmen as either a Republican or a Democrat.

Task – Identify the congressmen as either a Democrat or Republican based on his voting pattern.

solution – This problem is almost exactly similar to the fruits data we started with at the beginning of leaning Naive Bayes.

# 1. Import the dataset
library(mlbench)

data(HouseVotes84, package = "mlbench")

data = HouseVotes84
head(data)

# 2. train/test split
index = sample(1:nrow(data),nrow(data)*.8)
train = data[index,]
test = data[-index,]

# 3. model the data
model = naiveBayes(Class ~ ., data = train)

# 4. predict the data
pred = predict(model, test)

# 5. Accuracy
table(pred, test$Class)

library(caret)

cm = confusionMatrix(pred,test$Class)
print (cm)
Class	V1	V2	V3	V4	V5	V6	V7	V8	V9	V10	V11	V12	V13	V14	V15	V16
<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;	<fct&gt;
1	republican	n	y	n	y	y	y	n	n	n	y	NA	y	y	y	n	y
2	republican	n	y	n	y	y	y	n	n	n	n	n	y	y	y	n	NA
3	democrat	NA	y	y	NA	y	y	n	n	n	n	y	n	y	y	n	n
4	democrat	n	y	y	n	NA	y	n	n	n	n	y	n	y	n	n	y
5	democrat	y	y	y	n	y	y	n	n	n	n	y	NA	y	y	y	y
6	democrat	n	y	y	n	y	y	n	n	n	n	n	n	y	y	y	y
pred         democrat republican
  democrat        121          5
  republican       21         71
Confusion Matrix and Statistics

            Reference
Prediction   democrat republican
  democrat        121          5
  republican       21         71
                                          
               Accuracy : 0.8807          
                 95% CI : (0.8301, 0.9206)
    No Information Rate : 0.6514          
    P-Value [Acc &gt; NIR] : 1.002e-14       
                                          
                  Kappa : 0.7496          
                                          
 Mcnemar's Test P-Value : 0.003264        
                                          
            Sensitivity : 0.8521          
            Specificity : 0.9342          
         Pos Pred Value : 0.9603          
         Neg Pred Value : 0.7717          
             Prevalence : 0.6514          
         Detection Rate : 0.5550          
   Detection Prevalence : 0.5780          
      Balanced Accuracy : 0.8932          
                                          
       'Positive' Class : democrat    

Challenge – IMDB review Sentiment Analysis

Similar to the SPAM/HAM problem, we can also predict if an IMDB review is positive or negative based on the words in it.

# step 1 - Read the data file
library("xlsx")
data = read.xlsx("./data/imdb-reviews-sentiment.xlsx", sheetIndex = 1,  header=TRUE)

# step 2 - Create a DTM based on the text data
library(tm)
message_corpus = Corpus(VectorSource(data$review))
message_dtm <- DocumentTermMatrix(message_corpus)

# step 3 - function to convert the integers to "Yes" or "No" factors in the DTM
counts_to_factor = function(x){
  x = ifelse(x &gt; 0, 1, 0)
  x = factor(x, levels = c(0,1), labels = c("No", "Yes"))
  return (x)
}

# step 4 - Split the DTMs to Train and test data and convert the integers to factors for "Yes" and "No"
index = sample(1:nrow(data),nrow(data)*.8)
train = data[index,2]
test = data[-index,2]

msg_cor_train      = Corpus(VectorSource(data[train,]$review))
msg_train_dtm      = DocumentTermMatrix(msg_cor_train)
msg_train_dtm      = apply(msg_train_dtm, MARGIN = 2, counts_to_factor)
msg_class_train    = data$sentiment[train]


msg_cor_test       = Corpus(VectorSource(data[test,]$review))
msg_test_dtm       = DocumentTermMatrix(msg_cor_test)
msg_test_dtm       = apply(msg_test_dtm, MARGIN = 2, counts_to_factor)
msg_class_test     = data$sentiment[test]

# step 4 - model the data using Naive Bayes
library(e1071)
model = naiveBayes(msg_train_dtm, msg_class_train)

#step 4- predict the results from the model using the test data
pred = predict(model, msg_test_dtm)

# step 6 - get the accuracy from confusion matrix.
library(caret)
cm = confusionMatrix(pred,data$sentiment[test])
print (cm)

Confusion Matrix and Statistics

          Reference
Prediction negative positive
  negative        0        0
  positive        0     2000
                                     
               Accuracy : 1          
                 95% CI : (0.9982, 1)
    No Information Rate : 1          
    P-Value [Acc &gt; NIR] : 1          
                                     
                  Kappa : NaN        
                                     
 Mcnemar's Test P-Value : NA         
                                     
            Sensitivity : NA         
            Specificity :  1         
         Pos Pred Value : NA         
         Neg Pred Value : NA         
             Prevalence :  0         
         Detection Rate :  0         
   Detection Prevalence :  0         
      Balanced Accuracy : NA         
                                     
       'Positive' Class : negative   
                                     

Naive Bayes on continuous variables

So far, we have seen Naive Bayes work on factor variables. Does NB ever work on continous variables ? Yes, it does – ofcourse with discretized version of those variables ( Think of binning a normal distribution ). The key assumption there would be that the variable has a normal distribution. For example, think of the iris dataset – is the “Sepal length” of setosa species normally distributed ? Let’s find out.

from sklearn import datasets

iris = datasets.load_iris()

iris_data    = iris.data
iris_target  = iris.target

# matplotlib does not have the ability to plot the kernel density function
import matplotlib.pyplot as plt
# So, we are using seaborn instead
import seaborn as sns
%matplotlib inline

# You can check from these curves that Sepal data is normally distributed, but
# the petal data is not. Try them on one by one.

sns.distplot(iris_data[:,0], hist=True, kde=True)
sns.distplot(iris_data[:,1], hist=True, kde=True)
sns.distplot(iris_data[:,2], hist=True, kde=True)
sns.distplot(iris_data[:,3], hist=True, kde=True)

Only the Sepal data is normally distributed. Ideally, we should just be using the sepal data ( Sepal Length and Sepal Width ). However, let’s just use all of these and see what happens. As an exercise, try using just the sepal data and check for the accuracy.

# 1. train/test split
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(iris_data , iris_target, test_size=0.2)  

# 2. Naive Bayes modeling
from sklearn.naive_bayes import MultinomialNB

model = MultinomialNB().fit(X_train, y_train)  

# 3. Predict data
y_predict = model.predict(X_test)

# 4. Create a confusion matrix to check accuracy
print ( pd.crosstab(y_test, y_predict,rownames=['Actual'], colnames=['Predicted'],  margins=True) )

# 5. Print the accuracy score
from sklearn.metrics import confusion_matrix, accuracy_score

print ( confusion_matrix(y_test, y_predict) )
print ( accuracy_score(y_test,y_predict))
Predicted  0   1   2  All
Actual                   
0          6   0   0    6
1          0  11   1   12
2          0   1  11   12
All        6  12  12   30
[[ 6  0  0]
 [ 0 11  1]
 [ 0  1 11]]
0.9333333333333333


That’s pretty accurate as well , isn’t it ? In fact even though one of the assumptions (all the variables should be independent of each other ) is wrong, Naive Bayes still outperforms some other classification algorithms.

  • The priors ( Probability of a “Setosa” occuring or a “Virginica” occuring .. ) is 0.33 ( a third ) – which we know.
  • How about the conditional probabilities ? This is where it gets tricky for continuous variables. You cannot have conditional probabilities for each of the values ( as the number can get infinite ). So, in case of a normal distribution, an approximation is applied based on the following formula.

where μ is the mean and σ is the variance

Logistic Regression .

Logistic Regression


  Data visualization

Contents

What is Logistic Regression

Logistic regression is a type of linear regression. However, it is used for classification only. Huh.. that’s confusing, right ? Let’s dive in.

Let’s take the simple iris dataset. The target variable as you know by now ( from day 9 – Introduction to Classification in Python, where we discussed classification using K Nearest neighbours ) is categorical in nature. Let’s load the data first.

head(iris)
Sepal.Length	Sepal.Width	Petal.Length	Petal.Width	Species
<dbl&gt;	<dbl&gt;	<dbl&gt;	<dbl&gt;	<fct&gt;
1	5.1	3.5	1.4	0.2	setosa
2	4.9	3.0	1.4	0.2	setosa
3	4.7	3.2	1.3	0.2	setosa
4	4.6	3.1	1.5	0.2	setosa
5	5.0	3.6	1.4	0.2	setosa
6	5.4	3.9	1.7	0.4	setosa

Let’s simplify the dataset to just 2 species –

  • 0 – setosa
  • 1 – versi-color

Let’s just take data for 2 of the species ( say setosa and versi-color ) with just the sepal data ( sepal length and sepal width ) and plot it.

iris_data   = iris[0:100,]
iris_data = iris[0:100,]
plot(iris_data$Sepal.Length, iris_data$Sepal.Width, 
     col = iris_data$Species, pch = 19, 
     xlab = "Sepal Length",
     ylab = "Sepal Width")


Let’s simplify this further – say, we wanted to predict the species based on a single parameter – Sepal Length. Let’s first plot it.

plot ( iris_data[,1], iris_data[,5],
       pch = 19, col = "blue",
       xlab = "Sepal Length",
       ylab = "Setosa or Versicolor")


We know that regression is used to predict a continous variable. What about a categorical variable like this ? (species). If we can draw a curve like this,

and for all target values predicted with value > 0.5 put it in one category, and for all target values less than 0.5, put it in the other category – like this.

A linear regression (multilinear in this case) equation looks like this.

Logistic regression is almost similar to linear regression. The difference lies in how the predictor is calculated. Let’s see it in the next section.


Math

The name logistic regression is derived from the logit function. This function is based on odds.

logit function

Let’s take an example. A standard dice roll has 6 outcomes. So, what is the probability of landing a 4 ?

Now, what about odds ? The odds of landing a 4 is

$ where \hspace{0.5cm} P_4 = \frac {1}{6} $

So, when we substitute p into the odds equation, it becomes

OK. Now that we understand Probability and Odds, let’s get to the log of odds.

How exactly is the logistic regression similar to linear regression ? Like so.

Where the predictor ( log of odds ) varies between ( -∞ to +∞ ).

To understand this better, let’s plot the log of odds between a probabilty value of 0 and 1.

x = runif ( 100, min = 0.1, max = 0.9)
x = sort ( x )
y = log ( x / (1-x))
# plt.plot(x,y)
# plt.grid()

plot ( x, y , type="l")
grid(5, 5, lwd = 2)


This is the logistic regression curve. It maps a probability value ( 0 to 1 ) to a number ( -∞ to +∞ ). However, we are not looking for a continous variable, right ? The predictor we are looking for is a categorical variable – in our case, we said we would be able to predict this based on probability.

  • p >= 0.5 – Category 1
  • p < 0.5 – Category 2

In order to calculate those probabilities, we would have to calculate the inverse function of the logit function.


sigmoid function

The inverse of the logit curve is the inverse-logit or sigmoid function ( or expit function). The sigmoid function transforms the numbers ( -∞ to +∞ ) back to values between 0 and 1. Here is the formula for the sigmoid function.

x_new = y
y_new = exp(x_new) / (1 + exp(x_new))

plot (x_new, y_new, type="l")
grid(5, 5, lwd = 2)

Essentially, if we flip the logit function 900, you get the sigmoid function.

Here is the trick – As long as we are able to find a curve like the one below, although the target (predictor) is a value between 0 and 1 ( probabilities), we can say that all values below 0.5 ( half way mark ) belongs to one category and the remaining ( values above 0.5 ) belong to the next category. This is the essence of logistic regression.


Implementation

Let’s try to implement the logistic regression function in R step by step.

Data & Modeling

Just to keep the same example going, let’s try to fit the sepal length data to try and predict the species as either Setosa or Versicolor.

model = glm(Species ~ Sepal.Length + Sepal.Width, 
            data=iris_data, 
            family=binomial(link="logit"))

Warning message:
"glm.fit: algorithm did not converge"
Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"

y_pred = predict(model, iris_data, type="response")
y_pred 
1
2.22044604925031e-16
2
2.22044604925031e-16
3
2.22044604925031e-16
4
2.22044604925031e-16
5
2.22044604925031e-16
6
2.22044604925031e-16
7
2.22044604925031e-16
8
2.22044604925031e-16
9
2.22044604925031e-16
10
2.22044604925031e-16
11
2.22044604925031e-16
12
2.22044604925031e-16
13
2.22044604925031e-16
14
2.22044604925031e-16
15
2.22044604925031e-16
16
2.22044604925031e-16
17
2.22044604925031e-16
18
2.22044604925031e-16
19
7.94264732123352e-13
20
2.22044604925031e-16
21
7.2365355113674e-11
22
2.22044604925031e-16
23
2.22044604925031e-16
24
2.22044604925031e-16
25
2.22044604925031e-16
26
2.22044604925031e-16
27
2.22044604925031e-16
28
2.22044604925031e-16
29
2.22044604925031e-16
30
2.22044604925031e-16
31
2.22044604925031e-16
32
7.2365355113674e-11
33
2.22044604925031e-16
34
2.22044604925031e-16
35
2.22044604925031e-16
36
2.22044604925031e-16
37
6.31794371395771e-10
38
2.22044604925031e-16
39
2.22044604925031e-16
40
2.22044604925031e-16
41
2.22044604925031e-16
42
9.0266451369787e-10
43
2.22044604925031e-16
44
2.22044604925031e-16
45
2.22044604925031e-16
46
2.22044604925031e-16
47
2.22044604925031e-16
48
2.22044604925031e-16
49
2.22044604925031e-16
50
2.22044604925031e-16
51
1
52
1
53
1
54
1
55
1
56
1
57
1
58
0.999999999144556
59
1
60
0.999999999998715
61
1
62
1
63
1
64
1
65
1
66
1
67
1
68
1
69
1
70
1
71
1
72
1
73
1
74
1
75
1
76
1
77
1
78
1
79
1
80
1
81
1
82
1
83
1
84
1
85
0.999999998977491
86
1
87
1
88
1
89
1
90
1
91
1
92
1
93
1
94
1
95
1
96
1
97
1
98
1
99
1
100
1

These are probability values. We need to convert them to the actual factors (setosa & versicolor), because, we are dealing with just 2 classes. We can just use a simple ifelse ( ) syntax to convert all values > 0.5 to a versicolor and < 0.5 to a setosa.

y_pred_levels = as.factor(ifelse(y_pred&gt;0.5,"versicolor","setosa"))
y_pred_levels
1
setosa
2
setosa
3
setosa
4
setosa
5
setosa
6
setosa
7
setosa
8
setosa
9
setosa
10
setosa
11
setosa
12
setosa
13
setosa
14
setosa
15
setosa
16
setosa
17
setosa
18
setosa
19
setosa
20
setosa
21
setosa
22
setosa
23
setosa
24
setosa
25
setosa
26
setosa
27
setosa
28
setosa
29
setosa
30
setosa
31
setosa
32
setosa
33
setosa
34
setosa
35
setosa
36
setosa
37
setosa
38
setosa
39
setosa
40
setosa
41
setosa
42
setosa
43
setosa
44
setosa
45
setosa
46
setosa
47
setosa
48
setosa
49
setosa
50
setosa
51
versicolor
52
versicolor
53
versicolor
54
versicolor
55
versicolor
56
versicolor
57
versicolor
58
versicolor
59
versicolor
60
versicolor
61
versicolor
62
versicolor
63
versicolor
64
versicolor
65
versicolor
66
versicolor
67
versicolor
68
versicolor
69
versicolor
70
versicolor
71
versicolor
72
versicolor
73
versicolor
74
versicolor
75
versicolor
76
versicolor
77
versicolor
78
versicolor
79
versicolor
80
versicolor
81
versicolor
82
versicolor
83
versicolor
84
versicolor
85
versicolor
86
versicolor
87
versicolor
88
versicolor
89
versicolor
90
versicolor
91
versicolor
92
versicolor
93
versicolor
94
versicolor
95
versicolor
96
versicolor
97
versicolor
98
versicolor
99
versicolor
100
versicolor
library(caret)

cm = confusionMatrix(y_pred_levels,iris_data[,5])
cm

Warning message in levels(reference) != levels(data):
"longer object length is not a multiple of shorter object length"
Warning message in confusionMatrix.default(y_pred_levels, iris_data[, 5]):
"Levels are not in the same order for reference and data. Refactoring data to match."

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         50          0         0
  versicolor      0         50         0
  virginica       0          0         0

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9638, 1)
    No Information Rate : 0.5        
    P-Value [Acc &gt; NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
                                     
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                    1.0               1.0               NA
Specificity                    1.0               1.0                1
Pos Pred Value                 1.0               1.0               NA
Neg Pred Value                 1.0               1.0               NA
Prevalence                     0.5               0.5                0
Detection Rate                 0.5               0.5                0
Detection Prevalence           0.5               0.5                0
Balanced Accuracy              1.0               1.0               NA

Basic Evaluation

Let’s split up the data into training and test data and model it. This time let’s do the full iris dataset. Since there are 3 Species to be predicted, we cannot use glm with a “binomial” family algorithm. Let’s use another library called nnet. As usual, to evaluate categorical target data, we use a confusion matrix.

library(nnet)
index = sample(1:nrow(iris),nrow(iris)*.8)
train = iris[index,]
test  = iris[-index,]

model = multinom(Species~.,data = train)

# weights:  18 (10 variable)
initial  value 131.833475 
iter  10 value 11.516467
iter  20 value 4.881298
iter  30 value 4.469920
iter  40 value 4.263054
iter  50 value 3.911756
iter  60 value 3.823284
iter  70 value 3.598069
iter  80 value 3.591202
iter  90 value 3.570975
iter 100 value 3.570835
final  value 3.570835 
stopped after 100 iterations
pred = predict(model,test)


As usual, to evaluate categorical target data, we use a confusion matrix.

library(caret)

cm = confusionMatrix(pred, as.factor(test$Species))
cm

Confusion Matrix and Statistics

            Reference
Prediction   setosa versicolor virginica
  setosa         16          0         0
  versicolor      0          8         1
  virginica       0          2         3

Overall Statistics
                                          
               Accuracy : 0.9             
                 95% CI : (0.7347, 0.9789)
    No Information Rate : 0.5333          
    P-Value [Acc &gt; NIR] : 1.989e-05       
                                          
                  Kappa : 0.8315          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: setosa Class: versicolor Class: virginica
Sensitivity                 1.0000            0.8000           0.7500
Specificity                 1.0000            0.9500           0.9231
Pos Pred Value              1.0000            0.8889           0.6000
Neg Pred Value              1.0000            0.9048           0.9600
Prevalence                  0.5333            0.3333           0.1333
Detection Rate              0.5333            0.2667           0.1000
Detection Prevalence        0.5333            0.3000           0.1667
Balanced Accuracy           1.0000            0.8750           0.8365

That’s a 84% score.

Optimization

Let’s plot the logistic regression curve for the test data set.

Step 1 – Get the data

iris_data = iris[51:150,]
iris_data = iris_data[order(iris_data$Sepal.Length),]
head(iris_data)
Sepal.LengthSepal.WidthPetal.LengthPetal.WidthSpecies
<dbl><dbl><dbl><dbl><fct>
584.92.43.31.0versicolor
1074.92.54.51.7virginica
615.02.03.51.0versicolor
945.02.33.31.0versicolor
995.12.53.01.1versicolor
605.22.73.91.4versicolor

Step 2 – Model the data using a classifier

model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial)


Step 3 – Plot the Logit curve.

library(pROC)

plot( iris_data$Sepal.Length, iris_data$Species, pch = 19, col = iris_data$Species)

points ( iris_data$Sepal.Length, model$fitted.values + 2, col = "orange", type="l", lty=2, lwd=3)

As you can see, still there are quite a bit of mis-classifications. All the false negatives and false positives in the plot above are examples of mis-classification. Irrespective of the algorithm used to calculate the fit, there is only so much that can be done in increasing the classification accuracy given the data as-is. Other terms for True Positive and True Negative are

  • Sensitivity ( True Positive )
  • Specificity ( True Negative )

There is a specific optimization that can be done – and that is to specifically increase accuracy of one segment of the confusion matrix at the expense of the other segments. For example, if you look at a visual of the confusion matrix for our dataset.

For this dataset, classifying the species as “setosa” is positive and “versi-color” as negative.

  • setosa – positive
  • versi-color – negative

Let’s actuall calculate the accuracy values. Say the confusion matrix looks like this.[[11 1] [ 1 12]]

tp = (11) / (11 + 1)
fp = (1)  / (11 + 1)
fn = (1)  / (1 + 12)
tn = (12) / (12 + 1)

cat ("True Positive = ", tp, "\n")
cat ("False Positive = ", fp, "\n")
cat ("True Negative = ", tn, "\n")
cat ("False Negative = ", fn, "\n")
True Positive =  0.9166667 
False Positive =  0.08333333 
True Negative =  0.9230769 
False Negative =  0.07692308 

What if we want to predict 100% of setosa ( or a much more accurate classification than 0.9 ). Of course, like we discussed earlier, it will come at a cost. However, there is a usecase for this scenario. For example, if getting a particular classification right is extremely important, then we focus more on that particular classification than the others. Have you seen the Brad Pitt’s movie “World War Z” ? A plague emerges all around the world and an asylum is set up in Israel with a high wall. However, when you enter the wall, they make absolutely sure that you do not have the plague. Say, if you have the plague and if you call that as positive, then essentially you maximize the green box in the picture above.

Or another example would be, if you were to diagonize cancer patients, you would rather want to increase the odds of predicting a cancer patient if he/she really has it (True positive). Even it it comes at a cost of wrongly classifying a non-cancer patient as positive ( false positive ). The former can save a life while the later will just cost the company a patient.

Evaluation

ROC Curve

Receiver Operating Characteristics – also called ROC Curve is a measure of how good the classification is. Scikit Learn has an easy way to create ROC curve and calculate the area under the ROC curve. First off, let’s start with a classifier like Logistic Regression

Step 1 – Get the data

iris_data = iris[51:150,]
iris_data = iris_data[order(iris_data$Sepal.Length),]
model = glm( Species ~ Sepal.Length, data = iris_data , family = binomial)
library(pROC)

# iris$Species has 3 classes and hence 3 factors. So, we are converting them to
# 0/1 factor using factor (c(iris_data$Species) - 2). 
# -2 so that 3,2 become 1,0
roc = roc ( factor (c(iris_data$Species) - 2), model$fitted.values, plot = TRUE, legacy.axes = TRUE)

Setting levels: control = 0, case = 1

Setting direction: controls < cases

Area under the curve is an indicator of how accurate our classifier is. You can get it as follows.

roc$auc
0.7896

Reference