Distribution Plot

Distribution Plot


  Data visualization

Table of Contents

Experimental and Theoretical Probabilities

Suppose you conduct an experiment where a fair coin is tossed ‘n’ number of times and every outcome – heads or tails is recorded. So we now have data that consists of all possible outcomes of the experiment. Say we want to know the probability of the coin landing heads up, for which we have to calculate the Experimental probability. Experimental probability refers to the probability of an event occurring when an experiment was conducted, so it is based on direct observations. Experimental probability is the ratio of the number of trials with a certain outcome to total number of trials.

Experimental probability = Number of event occurrences/Total number of trials

There is another type of probability called theoretical probability where you do not conduct an experiment. In order to determine the theoretical probability of an event occurring, you need to analyze what could happen using the information available, use an existing formula or models.

Theoretical Probability of certain event = Number of favourable outcomes/Total number of possible outcomes

The results of the experimental probability and theoretical probability may differ because experimental probability is based on how the event actually occurred during the experiment while theoretical probability is based on the study of the possible outcomes that can occur for the given experiment. Experimental probability will get closer to theoretical probability as the number of trials increase that is when ‘n’ is extremely large.

Let us now see how to compare the experimental probability with theoretical probability using Distribution plots in Seaborn library.

Introduction to Distribution Plots

Distribution plots plot a univariate distribution of data. They are used to visually assess the distribution by comparing the observed data and expected outcomes. A distribution plot displays two different plots for the same variable. A histogram can be used to represent the observed data which is superimposed with a density curve that represents the expected outcomes.

Let us load the ‘tips’ dataset and pass the column data to the distplot() function for our analysis.

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

Create Distribution Plots

Overlay KDE plot on histogram

import seaborn as sns
from matplotlib import pyplot as plt

tips = sns.load_dataset('tips')
tips.head()
   total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
sns.distplot(a=tips['total_bill'])
plt.show()


By default the distplot displays a histogram overlaid with a density curve. If you want to omit the histogram, the parameter ‘hist’ can be set to False. If you want to omit the density curve, the parameter ‘kde’ can be set to False.

Coming to the above distplot, the horizontal axis of the histogram represents the entire range of data values. The vertical axis represents the relative frequency. The relative frequency is the count of data values that fall into a particular interval divided by the total number of counts. Notice that the vertical axis is rescaled to reflect relative frequencies rather than counts. Also, the area of all of the bars in the relative frequency histogram must sum to 1. The area under the density curve represents the theoretical probability. Just as in a histogram the area under the density curve equals one.

As you can notice from the plot, the distribution of ‘total_bill’ is unimodal and skewed to the right, which is evident from both the histogram and the density curve. The histogram and the density curve match well with each other indicating that the observed data and expected outcomes are very close.

Overlay Rug plot on KDE

sns.distplot(a=tips['total_bill'],hist=False,rug=True)
plt.show()

The distplot above displays a density curve and a rug plot. The rug plot can be displayed by setting the parameter ‘rug’ to True. A rug plot can be used to display the data distribution of a variable. For each data value, a vertical line is drawn on a single axis. The height of the vertical lines is arbitrary. A rug plot alone does not provide adequate information of the data distribution so they are used alongside other plots such as a histogram, kernel density estimate etc.,

If you notice the above plot, most of the values are clustered around the range 10-28. As we move towards the tail ends there are relatively few values.

Overlay Normal Distribution curve on histogram

from scipy.stats import norm
sns.distplot(a=tips['total_bill'],fit=norm,kde=False)
plt.show()


In the previous example, we have superimposed a kernel density curve over the histogram in order to compare the theoretical probability with experimental probability. It has been observed that ‘total_bill’ is right_skewed and the histogram matches well with the density curve.

In this example, we will assume that the data in column ‘total_bill’ has a normal distribution. With this assumption, we will overlay a normal distribution curve over the histogram. The scipy.stats module contains a large number of probability distributions, let us import the normal distribution function from this module and generate it using the distplot function. The normal distribution function indicated by ‘norm’ is passed to the argument ‘fit’. Set the parameter ‘kde’ to False so that you only see the histogram and normal distribution curve.

In the above plot, the normal distribution curve displayed in black color represents the theoretical probability and it has the following properties:

— the mean, median and mode are equal and are located at the center of the distribution.

— it is a bell-shaped curve. A bell curve has most of the values at the center of the distribution, about the mean, and
relatively few values in the tails.

— it is symmetric about the mean value and has no skew, which means that the mean is at the center and divides the area
into equal halves.

If you observe the histogram, the distribution is skewed to the right. So most of the values are small and are concentrated around the left tail of the distribution, while the right tail consisting of few exceptionally large values is stretched to the right. These exceptional values influence the mean and pull it to the right, so the mean will be greater than the median. We can infer from the distplot above that the data in column ‘total_bill’ is not normally distributed, as the histogram and the normal distribution curve do not match. Note that the normal probability model cannot be applied to skewed data.

Customizing the Distribution Plots

sns.distplot(a=tips['total_bill'],hist_kws={'histtype':'step','align':'left','color':'red','lw':3},
                                  kde_kws={'color':'g','shade':True})
plt.show()

We can change the appearance of the distplot by specifying values for the keyword arguments ‘hist_kws’, ‘kde_kws’, ‘rug_kws’. We can change the properties such as color, linewidth, linestyle, transparency of the lines etc., of the various plots.

Distribution Plot

Table of Contents

Experimental and Theoretical Probabilities

Introduction to Distribution Plots

Create Distribution Plots

  • #### Overlay KDE plot on histogram
  • #### Overlay Rug plot on KDE
  • #### Overlay Normal Distribution curve on histogram
  • #### Customizing the Distribution Plots

Experimental and Theoretical Probabilities

Suppose you conduct an experiment where a fair coin is tossed ‘n’ number of times and every outcome – heads or tails is recorded. So we now have data that consists of all possible outcomes of the experiment. Say we want to know the probability of the coin landing heads up, for which we have to calculate the Experimental probability. Experimental probability refers to the probability of an event occurring when an experiment was conducted, so it is based on direct observations. Experimental probability is the ratio of the number of trials with a certain outcome to total number of trials.

Experimental probability = Number of event occurrences/Total number of trials

There is another type of probability called theoretical probability where you do not conduct an experiment. In order to determine the theoretical probability of an event occurring, you need to analyze what could happen using the information available, use an existing formula or models.

Theoretical Probability of certain event = Number of favourable outcomes/Total number of possible outcomes

The results of the experimental probability and theoretical probability may differ because experimental probability is based on how the event actually occurred during the experiment while theoretical probability is based on the study of the possible outcomes that can occur for the given experiment. Experimental probability will get closer to theoretical probability as the number of trials increase that is when ‘n’ is extremely large.

Let us now see how to compare the experimental probability with theoretical probability using Distribution plots in Seaborn library.

Introduction to Distribution Plots

Distribution plots plot a univariate distribution of data. They are used to visually assess the distribution by comparing the observed data and expected outcomes. A distribution plot displays two different plots for the same variable. A histogram can be used to represent the observed data which is superimposed with a density curve that represents the expected outcomes.

Let us load the ‘tips’ dataset and pass the column data to the distplot() function for our analysis.

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

Create Distribution Plots

Overlay KDE plot on histogram

In [7]:

import seaborn as sns
from matplotlib import pyplot as plt

In [8]:

tips = sns.load_dataset('tips')

In [9]:

tips.head()

Out[9]:

total_billtipsexsmokerdaytimesize
016.991.01FemaleNoSunDinner2
110.341.66MaleNoSunDinner3
221.013.50MaleNoSunDinner3
323.683.31MaleNoSunDinner2
424.593.61FemaleNoSunDinner4

In [10]:

sns.distplot(a=tips['total_bill'])
plt.show()

By default the distplot displays a histogram overlaid with a density curve. If you want to omit the histogram, the parameter ‘hist’ can be set to False. If you want to omit the density curve, the parameter ‘kde’ can be set to False.

Coming to the above distplot, the horizontal axis of the histogram represents the entire range of data values. The vertical axis represents the relative frequency. The relative frequency is the count of data values that fall into a particular interval divided by the total number of counts. Notice that the vertical axis is rescaled to reflect relative frequencies rather than counts. Also, the area of all of the bars in the relative frequency histogram must sum to 1. The area under the density curve represents the theoretical probability. Just as in a histogram the area under the density curve equals one.

As you can notice from the plot, the distribution of ‘total_bill’ is unimodal and skewed to the right, which is evident from both the histogram and the density curve. The histogram and the density curve match well with each other indicating that the observed data and expected outcomes are very close.

Overlay Rug plot on KDE

In [6]:

sns.distplot(a=tips['total_bill'],hist=False,rug=True)
plt.show()

The distplot above displays a density curve and a rug plot. The rug plot can be displayed by setting the parameter ‘rug’ to True. A rug plot can be used to display the data distribution of a variable. For each data value, a vertical line is drawn on a single axis. The height of the vertical lines is arbitrary. A rug plot alone does not provide adequate information of the data distribution so they are used alongside other plots such as a histogram, kernel density estimate etc.,

If you notice the above plot, most of the values are clustered around the range 10-28. As we move towards the tail ends there are relatively few values.

Overlay Normal Distribution curve on histogram

In [25]:

from scipy.stats import norm
sns.distplot(a=tips['total_bill'],fit=norm,kde=False)
plt.show()

In the previous example, we have superimposed a kernel density curve over the histogram in order to compare the theoretical probability with experimental probability. It has been observed that ‘total_bill’ is right_skewed and the histogram matches well with the density curve.

In this example, we will assume that the data in column ‘total_bill’ has a normal distribution. With this assumption, we will overlay a normal distribution curve over the histogram. This is done by passing the normal distribution function indicated by ‘norm’ to the argument ‘fit’. Set the parameter ‘kde’ to False so that you only see the histogram and normal distribution curve.

In the above plot, the normal distribution curve displayed in black color represents the theoretical probability and it has the following properties:

— the mean, median and mode are equal and are located at the center of the distribution.

— it is a bell-shaped curve. A bell curve has most of the values at the center of the distribution, about the mean, and
relatively few values in the tails.

— it is symmetric about the mean value and has no skew, which means that the mean is at the center and divides the area
into equal halves.

If you observe the histogram, the distribution is skewed to the right. So most of the values are small and are concentrated around the left tail of the distribution, while the right tail consisting of few exceptionally large values is stretched to the right. These exceptional values influence the mean and pull it to the right, so the mean will be greater than the median. We can infer from the distplot above that the data in column ‘total_bill’ is not normally distributed, as the histogram and the normal distribution curve do not match. Note that the normal probability model cannot be applied to skewed data.

Customizing the Distribution Plots

In [27]:

sns.distplot(a=tips['total_bill'],hist_kws={'histtype':'step','align':'left','color':'red','lw':3},
                                  kde_kws={'color':'g','shade':True})
plt.show()

We can change the appearance of the distplot by specifying values for the keyword arguments ‘hist_kws’, ‘kde_kws’, ‘rug_kws’. We can change the properties such as color, linewidth, linestyle, transparency of the lines etc., of the various plots.

Violin plot

Violin plot


  Data visualization

Table of Contents

Introduction to Violin plots

A Box plot provides statistical information such as the median, range and quartiles but it does not show the distribution of a data set. If a distribution has multiple peaks it is not easy to figure this out using box plots. Comparatively, Violin plots show the detailed distribution of a numeric variable. Violin plot is a combination of a Box Plot and a Density Plot, it is a box plot with a density plot on each side. The density plots on either side of the box plot are symmetric as they show the distribution of the same numeric variable.

Inside of each violin is a box plot of the data, the left and right edges of the inner box represent the first and third quartiles, and the white circle inside the box is at the median of the data. The thin black lines extending from the box represent the minimum and maximum values in the data.

The density plot estimates the probability density of a variable producing a smooth curve. A violin plot shows the distribution’s density using the width of the plot.

Creating plots using violinplot() function

Create a violin plot (Example1)

The violinplot() function in Seaborn is used to visualise the distribution of numeric data and also compare different categories or groups. Let us graph some violin plots using this function.

For our analysis we will load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time. Let us visualize the quantitative data in column ‘total_bill’, this column is passed to the parameter ‘x’.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np

tips = sns.load_dataset('tips')
tips.head()
total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
sns.set(style="whitegrid")
sns.violinplot(data=tips,x='total_bill',orient='h')
plt.title('Total bill')
plt.show()


The figure above displays a violin plot which is a combination of a box plot and a density plot. Notice that the density plots on either side of the box plot are symmetric as they both display the distribution of the column ‘total_bill’. The x-axis displays the range of values in the column ‘total_bill’.

The range of the box plot is approximately 38, excluding the outliers. The median, indicated by the white circle is somewhere around 17 and is closer to the lower end of the box, which means the data points on the left side of the median are clustered. The median is far away from the upper end of the box, which means the data points on the right side of the median are scattered.

Notice that the whisker on the upper end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. Overall the distribution is positively skewed(skewed right). A distribution that has most of the values clustered on the left side of the distribution and has a long right tail is said to be positively skewed.

Next let us interpret the density plot. The plot has one distinct peak. It is clear from the above plot that the right tail is longer compared to the left tail, which means the distribution consists of few extremely large values. A long right tail indicates that the distribution is right-skewed or positively skewed. In a right skewed distribution, the mean is greater than the median because the tail of the distribution pulls the mean. This shows that the mean is largely influenced by extreme values whereas the median is not strongly influenced by few extreme values. The density plot also confirms that the data in column ‘total_bill’ is positively skewed.

Create a violin plot (Example2)

Next let us plot the data in column ‘tip’ using the violinplot() function.

sns.set(style="whitegrid")
sns.violinplot(data=tips,x='tip')
plt.title('Tips')
plt.show()

The distribution of ‘tip’ is bimodal. There are two peaks – one at around 2 and the other slightly greater than 3. The violin plot shows a detailed distribution of the data including the peaks and positions, this information would have been missed in a box plot.

Compare violin plots of different categories

In the next example, we will compare the tips received from the two groups male and female customers. Lets plot our data using the violinplot() function using the ‘sex’ and ‘tip’ columns.

sns.set(style="whitegrid")
sns.violinplot(data=tips,y='sex',x='tip')
plt.title('Tips')
plt.show()

The categorical variable ‘sex’ is on the y-axis and the quantitative variable ‘tip’ is on the x-axis.

Let us compare the two violin plots – Male and Female:

  • Notice that both the density plots rise up to a certain point and then the Female falls gradually. But the Male has another peak value at approximately 3.5.
  • Male and Female density plots look similar for low values.
  • Male has a more elongated distribution compared to Female.
  • Median for Male is slightly pulled towards Q3, for Female the median is at the center.
  • Both the distributions are positively skewed.

Overall, it appears that females pay less tips compared to the males.

Compare violin plots of different sub-categories

Previously we have split the data into two groups based on the categorical variable. Let us further split these groups into sub-groups by introducing a third variable.

In the next example, the data in column ‘smoker’ is passed to the ‘hue’ parameter, which further categorizes the data and creates two side by side violin plots for each group that is Male and Female. A separate colored violin plot represents each sub-group and a legend is added to let us know what each sub-category is. Using violin plots let us see who tip better.

sns.violinplot(data=tips,x='tip',y='sex',hue='smoker')
plt.title('Tips')
plt.show()


In the plot above we notice that Male-Smoker has a long-tail distribution compared to any other group. Comparing the violin plots in each category, we see that smokers have long-tailed distribution compared to non-smokers.

Split violin plots

Violin plots are useful for comparing distributions across different categories. By setting the parameter ‘split’ to True, we can split each violin in half to see the difference between two categories. Each split violin plot represents a category, so we can compare two groups in the same violinplot. The hue variable is a binary which has two possible values – ‘smoker’ and ‘non-smoker’. We can display the distribution of these two sub-categories on each side of the violin by setting the parameter split to ‘True’. This is very useful to directly compare the widths of the distributions.

sns.violinplot(data=tips,x='tip',y='sex',hue='smoker',split=True)
plt.title('Tips')
plt.show()


The box plot inside the first violin displays the distribution of the Male group, which means it does not display the distribution of each sub-group separately. Similarly the box plot inside the second violin displays the distribution of the Female group. It can seen from the plot that whether it is Male or Female category, smokers have a long-tail distribution.

Customizing the violin plots

Instead of displaying a Box plot inside the density plot it is possible to only display the quartile positions using vertical lines by setting the parameter ‘inner’ to ‘quartile’. This is useful to reduce the visual noise. The violin plot below displays the quartiles Q1, Q2 (median) and Q3 using vertical lines at the respective positions.

sns.set(style="whitegrid")
sns.violinplot(data=tips,x='tip',y='day',hue='sex',inner='quartile',split=True)
plt.title('Tips')
plt.show()


It is also possible to display all the datapoints in the distribution by setting the parameter ‘inner’ to ‘points'(dots) or ‘sticks'(vertical lines).

sns.set(style="whitegrid")
sns.violinplot(data=tips,x='tip',y='day',hue='sex',inner='points',split=True)
plt.title('Tips')
plt.show()

sns.set(style="whitegrid")
sns.violinplot(data=tips,x='tip',y='day',hue='sex',inner='sticks',split=True)
plt.title('Tips')
plt.show()


Passing ‘None’ to the parameter ‘inner’ will draw unadorned violins. If no value is passed, a box plot will be displayed by default.

Box plots

Box plots

Table of Contents

Spread of data and Central tendency

To analyze the distribution of quantitative data, we use two important features: spread of data and central tendency. The spread of data or the variability of the values in the data indicates how spread out a distribution is. Central tendency provides a descriptive summary of a data set through a single value that reflects the center of the data distribution.

Measure the spread of data

Two most commonly used methods to measure the spread of a distribution are Range and Interquartile Range (IQR).

Range: The range is the difference between the highest and lowest values in a data set and is the simplest measure of spread.

Interquartile Range(IQR): IQR indicates the spread of the middle 50% of the data.

In the above figure, we have two datasets Height_GroupA and Height_GroupC which represent the heights (in centimeters) of students belonging to different groups. Both groups have 15 students each. The data points for both the datasets are displayed on a number line in the figure.

The range of Height_GroupA is 78.6 while the range of Height_GroupC is 28.4. If the variation of values is more then the values lie far away from each other, if the variation of values is less then the values lie very close to each other. It is also evident from the above figure that the values in Height_GroupA are more dispersed or spread out compared to the values in the dataset Height_GroupC.

Measure the Central tendency

The three most common measures of central tendency are the mean, median, and mode.

Mean: The mean is the sum of the values in a set of data divided by the number of values in the data set.

Median: The median is the middle value in the data set when the values are arranged in ascending or descending order. 

Mode: The mode is the most frequently occurring value in the dataset.

The values in Height_GroupC range between 78.4 to 106.8 and its mean value is 96.1, which can be used to represent the values in the data set.  Let us modify the last value in this data set to 300. Can you use the new mean which is equal to 108.9 to represent all the values in this group? The new mean is nowhere close to any of the values in the data set and does not reflect the typical height of a student. The mean is not a robust tool because it is largely influenced by very low or high values in the data.

For distributions that have extreme values we can use the median instead of mean to represent the entire data set with a single value. The median is the middle value of an ordered data set.  Half of the data lies below the median and the other half above it. Median values are less likely to be influenced by very high or very low values in the data set. In the above figure, number enclosed in a circle is the median of the data set.

Introduction to Box plots and five-number summary

The spread or variability of a distribution can be graphically represented using Box plots. Box or Box-and-Whisker plots are used to visually represent the distribution of numerical data. Box plots also display the spread or range of data and show any signs of skewness in your data. Skewness in statistics represents an imbalance, it measures the lack of symmetry in data distribution.

Box plots are constructed using five key values. The key values are called a five-number summary. The five-number summary provides information as to where the data in a data set is clustered or where the data is scattered.

A five-number summary consists of the below numbers:

Minimum: The lowest value in the data, excluding outliers.

First Quartile (Q1): Twenty-five percent of values in the data fall below the first quartile value.

Median(Q2): When the data values are arranged in increasing order, median is the number located at the center of these values, it is generally used for skewed distributions.

Third Quartile (Q3): Seventy-five percent of the values fall below the third quartile.

Maximum: The highest value in the data, excluding outliers.

The first quartile, median and the third quartile divide the data into four equal parts, so that there will be approximately equal number of values in each of the parts.

As the name indicates box plots use a rectangle (or a box) for plotting and it has two parts, a box and a set of whiskers. Whiskers are the lines that extend from the box out to the lowest and highest values that are not outliers. Outliers are the data points that lie very far away from the whiskers, they are unusually large or small values in the data. Outliers can affect the overall data due to its very high or low extreme values. In a box plot, outliers are usually shown as dots on the either ends of the box away from the rest of the data.

Box plot for symmetric distribution

Box plot for symmetric distribution

Observe the above box plot for the dataset Height_GroupA. The lower edge of the rectangle denotes the First quartile, the upper edge denotes the Third quartile and the line at the centre is the median. The median is located at the center of the box, and it divides the dataset into two equal parts. When the median is at the center of the box, and the whiskers are about the same distance from the median on both sides of the box, then the distribution is said to be symmetric.

Range –> 169.2 – 90.6 = 78.6

IQR –> 159.5 – 99.9 = 59.6

The Range and IQR have high values which means the spread of values in the dataset is large.

Box plot for negatively skewed distribution

Box plot for negatively skewed distribution

The median divides the dataset into two equal parts. In the above box plot, the median is not located at the center of the box but is drawn closer to the third quartile(Q3). The median is closer to the upper end of the box because it gets pulled in the direction of cluster of values. Observe the values to the left of the median, they are more spread out or scattered. In this case, the distribution is said to be negatively skewed (skewed left). A distribution that has most of the values clustered on the right side of the distribution and has a long left tail is said to be negatively skewed.

  • Range –> 169.2-90.6 = 78.6
  • IQR –> 159.5-99.8 = 59.7

The Range and IQR have high values which means the spread of values in the dataset is large.

Box plot for negatively skewed distribution with less spread

Box plot for negatively skewed distribution with less spread

The median for the above box plot is closer to the third quartile(Q3) for the same reason as explained in the earlier example. In this example, the whisker on the lower end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. In this example, the distribution is negatively skewed (skewed left).

  • Range –> 106.8-78.4 = 28.4
  • IQR –> 104.7-90.9 = 13.8

Now that we have a basic understanding of box plots, let us graph some box plots using the Seaborn library. The boxplot() function in Seaborn is used to visualise the distribution of numeric data and also compare different categories or groups.

Creating plots using boxplot() function

Box plot for single distribution

Box plots can be used to visualize single or multiple distributions. Let us look at an example that visualizes a single distribution. For our analysis we will load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time. The input data can be passed directly to the x, y, and/or hue parameters of the boxplot() function. Let us visualize the quantitative data in column ‘total_bill’, this column is passed to the parameter ‘x’.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np

tips = sns.load_dataset('tips')
tips.head()

total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
sns.boxplot(data=tips,x='total_bill')
plt.title('Total bill')
plt.show()

Single distribution

The box plot is placed in the x-y plane and it displays the distribution of the quantitative variable using the five-number summary plotted along x-axis. From the above box plot, we see that the amount spent by the customers in the restaurant ranges between 3 – 41 $, excluding the outliers. The median is somewhere around 17 and is closer to the lower end of the box, which means the data points on the left side of the median are clustered. The median is far away from the upper end of the box, which means the data points on the right side of the median are scattered.

Notice that the whisker on the upper end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. Overall the distribution is positively skewed(skewed right). A distribution that has most of the values clustered on the left side of the distribution and has a long right tail is said to be positively skewed.

The dots located outside the right whisker are the outliers or the unusually large values in the data.

The above box plot is oriented horizontally, orientation can be changed by explicitly passing ‘v’ or ‘h’ to parameter ‘orient’.

Compare box plots of different categories – 1

In the next example, we will compare the tips received from the two groups male and female customers. Let us plot our data using the boxplot() function using the ‘sex’ and ‘tip’ columns.

sns.boxplot(data=tips,x='sex',y='tip')
plt.title('Tips received from male and female customers')
plt.show()

Compare different categories

In the tips dataset, ‘sex’ is a categorical grouping variable that splits the data into two parts- male and female customers. So we have two box plots generated for each category. The categorical variable ‘sex’ is mapped to the x-axis and the quantitative variable ‘tip’ is mapped to the y-axis.

Let us compare the two box plots – Male and Female:

  1. The two box plots have roughly identical medians.
  2. Male is more spread out compared to Female.
  3. Male has multiple outliers whereas Female has one outlier.
  4. Both the distributions are positively skewed.

Overall, it appears that females pay less tips compared to the males.

Compare box plots of different categories – 2

In this example, let us analyse the effects of the day of week on the bill amount at the restaurant. Let us pass the data in columns ‘day’ and ‘total_bill’ to the arguments ‘x’ and ‘y’ of the boxplot() function.

sns.boxplot(data=tips,x='day',y='total_bill',palette='Set1')
plt.title('Sales Analysis')
plt.show()

Compare different categories

Looking at the four boxplots Thur, Fri, Sat and Sun we can say that the customer spending is more during the weekends.

Side-by-side box plots

Previously we have split the data into groups based on the categorical variable. Let us further split these groups into sub-groups by introducing a third variable.

In the next example, the data in column ‘smoker’ is passed to the ‘hue’ parameter, which further categorizes the data and creates two side by side box plots for each group that is Male and Female. A separate colored boxplot represents each sub-group and a legend is added to let us know what each sub-category is. Using boxplots let us see who tip better.

sns.boxplot(data=tips,x='sex',y='tip',hue='smoker')
plt.title('Grouped Boxplots')
plt.show()

Side-by-side box plots

It is evident from the box plots that Males tip more than Females. If we compare the sub-groups (Smoker/Non-Smoker), Smokers tip better than Non-Smokers.

KDE plots

KDE plots


  Data visualization

Table of Contents

Histogram for the sum of two dice

Let us begin by setting up the input data. Suppose we are conducting an experiment by rolling two dice multiple times, every outcome of the experiment – which is the sum of the faces is recorded. We will visualize the distribution of this data using a Histogram.

We will call the randint() function defined in the numpy library to generate an array of random integers between 2 (smallest possible sum) and 12 (highest possible sum). This array is then passed to the hist() function in Matplotlib library to generate a Histogram. Initially the sample data contains 10 observations.

import pandas as pd
import seaborn as sns
from matplotlib import pyplot as plt
import numpy as np
x = np.random.randint(2,13,10)

x
array([ 2,  7,  6,  6,  4,  3,  3, 12,  2,  9])
plt.hist(x)
plt.show()

The hist() function automatically calculates the size of each bin of the histogram. The entire range of input data is split into equal-sized bins and then for each bin, the number of data points from the data set that fall into each bin are counted. A histogram displays the distribution of discrete data and it gives an estimate as to where the values are concentrated. From the above histogram, we observe that the outcomes 2, 3 and 6 have occurred 2 times while other outcomes have occurred 1 time.

Histogram for the sum of three dice

In the next example, we will see the distribution of the data when 3 dice are rolled, the sample data contains 50 observations.In [3]:

y = np.random.randint(3,19,50)
y
array([ 7,  9,  6, 13, 12, 11, 16, 13, 17,  6, 16, 10, 14, 18,  7, 11,  6,
       15, 15, 13, 10, 16, 15, 14, 17,  4,  9,  4, 12,  4, 13, 17,  7,  6,
        8, 13,  9, 13,  9, 17,  5, 17,  7,  7, 15,  8, 15, 17, 10, 17])
plt.hist(y)
plt.show()

The above histogram displays the distribution of continuous data. Each bin is spaced two numbers apart 4-6, 6-8, 8-10 and so on. Say for example, the default number of bins does not provide sufficient details of our distribution. So lets change few parameters of the histogram – the number of bins and lower range of the bins.

plt.hist(y,bins=24,range=(0,20))
plt.show()


As you can see changing the number of bins and the range affects the appearance of the histogram. When we change the number of bins, the data points get organised or grouped differently. The different grouping affects the appearance of the histogram. The appearance of a histogram can change markedly with different choices of number of bins and end points leading to different interpretations of same data. An alternative way is to use Kernel Density Plots which removes the dependency on the end points of the bins.

Kernel Density Estimate (KDE)

In the previous examples, we have used a histogram to estimate the distribution of data. Kernel density estimation(KDE) is another widely used technique for estimating the distribution of data. In a histogram, each value in the dataset is represented using rectangular bars/blocks, and the blocks are piled on top each other into the bins to show the number of values in each bin range. In a KDE plot, each data point in the dataset is represented using different shapes such as a box, triangle, Gaussian curve etc., also each data point contributes a small area around its true value. A KDE plot is produced by drawing a small continuous curve (also called kernel) for every individual data point along an axis, all of these curves are then added together to obtain a single smooth density estimation. Unlike a histogram, KDE produces a smooth estimate. When drawing the individual curves we allow the kernels to overlap with each other which removes the dependency on the end points of the bins.

Visualising data using histogram and KDE plot

The kernel width or bandwidth controls the smoothness of the resulting density curve. If the bandwidth is too small, the density estimate has too many peaks making the distribution difficult to interpret. On the other hand, if the bandwidth is too large, then the information about real distribution and subtle features of the data under analysis will be obscured. The bandwidth has to be chosen appropriately such that it highlights all the important features while maintaining smoothness.

KDE plot with low bandwidth
KDE plot with high bandwidth

Seaborn provides the kdeplot() function to plot a univariate or bivariate kernel density estimate. Lets generate a KDE plot using the dataset ‘x’ created above. The bandwidth of the kernel can be adjusted using the ‘bw’ argument.

Univariate Kernel Density Estimate

A Univariate plot is based on a single variable.

sns.kdeplot(x)
plt.title('KDE Plot')
plt.show()


Above we see a KDE plot for the dataset ‘x’. Note that the y-axis is in terms of relative frequency and not the number of data points falling into the range. The density curve has one distinct peak indicating the distribution is unimodal.

KDE plot with low bandwidth

sns.kdeplot(x,bw=0.1)
plt.title('KDE Plot with low bw')
plt.show()


We see that when the bandwidth is too low, the density curve has too many peaks and appears to be multimodal (having multiple peaks).

KDE plot with high bandwidth

sns.kdeplot(x,bw=1)
plt.title('KDE Plot with high bw')
plt.show()


If the bandwidth is too large, then smaller features in the distribution of the data may disappear. In the above plot small bumps got smoothed out obscuring important information. This is also known as over-smoothing of the curve.

Customizing the KDE plot

sns.kdeplot(y,shade=True)
plt.show()


By setting the parameter ‘shade’ to True, the area under the density curve is filled with a color.

sns.kdeplot(y,vertical=True)
plt.show()


If the parameter ‘vertical’ is set to True, density is plotted on the y-axis.

Bivariate Kernel Density Estimate

While a univariate KDE is based on one random variable, a bivariate KDE is based on two independent random variables. The kdeplot() function in Seaborn can be used to generate bivariate KDE which reveals the relationship between the two variables. The bivariate KDE has a three dimensional bell shaped appearance. Even though 3D plots are visually appealing they are difficult to read because some parts of the plot are blocked by other parts and not all applications support rotation of 3D plots.

Bivariate Normal density
image source: https://online.stat.psu.edu/

One common way of displaying information about a 3D surface by using only two dimensions is to use level curves or contour lines. In the next example we will use Contour plots to illustrate bivariate KDEs. Contour plots represent data for three variables in two dimensions. Each contour line is drawn in an xy-plane by varying the x and y values and keeping the third variable as a constant. That means each line is drawn by joining points having equal value or which have the same density. Lets now plot a bivariate KDE by passing the arguments data,data2 to the kdeplot() function which specify the x-coordinates and y-coordinates of the points to be plotted.

z = np.random.randint(3,19,50)

z
array([ 6,  9, 13,  5,  7,  3,  5,  8,  3,  7,  8, 16,  3,  8, 16, 18, 14,
       18,  5,  6,  7, 10, 17, 17, 15,  8, 12,  5, 14,  7,  6, 14,  5,  5,
        6,  4, 13, 18,  5, 16,  9, 11,  9,  4,  9,  3, 16, 17, 15, 14])
sns.kdeplot(data=y,data2=z)
plt.show()


Here is a two dimensional Kernel Density Estimate shown using contour plot. It displays the joint distributions of random variables ‘y’ and ‘z’. From the plot it is not clear as to which regions have high density or which regions have low density. So let us customize the density plot by color coding the contour lines.

Customizing the Bivariate density plot

sns.kdeplot(data=y,data2=z,cmap='Greens',n_levels=20)
plt.show()


Contour plots can use a colormap to color the different levels. We have applied “Greens” colormap to the above plot. “Greens” is a sequential colormap which is best to visualize numeric data that progresses from low to high by gradually increasing darkness and saturation of the color. By default, darker colors represent higher density values.

We can also specify the number of levels that we want to see in the contour plot. If there are too few levels in the map, important details may be lost, while too many levels makes the plot look cluttered.

sns.kdeplot(data=y,data2=z,cmap='Blues',shade=True)
plt.show()


By setting the parameter ‘shade’ to True, the area between adjacent contour lines can be filled with varying shades of a color. The darker the shade of the color, the higher is the density.

sns.kdeplot(data=y,data2=z,cmap='Blues',cbar=True,shade=True)
plt.show()


You can add a vertical colorbar to the right side of the plot by setting the parameter ‘cbar’ to True.

Modules vs Packages vs Libraries vs Frameworks

Modules vs Packages vs Libraries vs Frameworks


  Data visualization

Table of Contents

Python Module

A Python module is just a python file with a .py extension. It can contain variables or functions – elements that can be referred from another python file using an import.

For example, let’s create a simple python file – user.py in a directory.

It just has a variable name and a function nameCapitalize(name). These can be referenced in another file using the standard import process that we already know.

Say, we create another file test.py and then import user.py using

import user


That’s it, the variable name inside the user.py module can be referenced using the familiar dot (.) notation.

user.name


You can even reference the functions inside the referenced module.

user.nameCapitalize(name)


Python Package

A python package is just a collection of Python files (or modules). Earlier, we were able to import a specific python file. Now, we have created a folder called customers and inside it created another python file(or module) called customer.py.

You can have any levels of hierarchy inside a package using folders for nesting python modules.

Python Library

A python library is a collection of python packages. The python standard library for example is a collection of Python Packages. Theoritically, there is no difference between a Python Module and Python Package. Typically python library is a collection of modules or packages. For example, you can view the requests library here

Python Framework

A framework is a collection of python libraries. Some examples of popular python frameworks are

  • django – used to build web apps
  • flask – used to build APIs
  • FastAPI – used to build APIs

Heatmap

Heatmap


  Data visualization

Table of Contents

What is a Heatmap?

A Heatmap is a graphical representation of data which represents data values using colors. Heat maps make it easy to visualize complex data and are useful for identifying patterns and areas of concentration. They are used to show the relation between two variables, plotted on x and y axis. The variables plotted on each axis can be of any type, categorical labels or numerical values. When plotting categorical variables on the axis, it is good to properly order the variables based on some criteria in order to reveal the patterns in the data.

We can use categorical color palettes to represent categorical data, while numerical data requires a colour scale that blends from one colour to another, in order to represent high and low values.

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

x = np.random.rand(5,5)
sample_data = pd.DataFrame(x)
sample_data
0	1	2	3	4
0	0.104282	0.520084	0.586270	0.421520	0.924707
1	0.264854	0.518364	0.965355	0.545920	0.975186
2	0.546176	0.509287	0.142528	0.992683	0.136078
3	0.349071	0.260509	0.267000	0.135422	0.025028
4	0.845259	0.869380	0.178070	0.917876	0.344369

Consider the DataFrame – sample_data which consists of data in a tabular format. The above table has 5 rows and 5 columns. Let us visualize this data using a heatmap, that means we have to transform the values in this DataFrame into different colors. Seaborn provides the heatmap() function for this purpose.

sns.heatmap(data=sample_data)
plt.xlabel('column names-->')
plt.ylabel('index values-->')
plt.title('Heatmap with random values')
plt.show()


The Pandas DataFrame is passed to the heatmap() function and the plot is displayed using the Matplotlib show() function.

The heatmap generated above represents the numerical values in sample_data with different colors. The DataFrame index values are used as y-tick labels and the column names are used as x-tick labels. This heatmap uses dark colors to display low values and light colors to display high values.

A color bar which represents the relationship between colors and values is displayed to the right hand side of the figure. The color bar has ticks positioned at 0.2, 0.4, 0.6, 0.8. The tick positions are calculated depending on the minimum and maximum data values in the input data. The colormap used here has dark colors at one extreme and light colors at the other extreme.

Let us look into another example, by loading the built-in ‘flights’ dataset.

flights_dataset = sns.load_dataset('flights')
flights_dataset.head(15)
year	month	passengers
0	1949	January	112
1	1949	February	118
2	1949	March	132
3	1949	April	129
4	1949	May	121
5	1949	June	135
6	1949	July	148
7	1949	August	148
8	1949	September	136
9	1949	October	119
10	1949	November	104
11	1949	December	118
12	1950	January	115
13	1950	February	126
14	1950	March	141

Data in the ‘flights_dataset’ DataFrame is in long form, we will reorganise the data and convert to wide form data using the Pandas pivot_table() function.

flights= pd.pivot_table(data=flights_dataset,index='month',columns='year',values='passengers')
flights
year	1949	1950	1951	1952	1953	1954	1955	1956	1957	1958	1959	1960
month												
January	112	115	145	171	196	204	242	284	315	340	360	417
February	118	126	150	180	196	188	233	277	301	318	342	391
March	132	141	178	193	236	235	267	317	356	362	406	419
April	129	135	163	181	235	227	269	313	348	348	396	461
May	121	125	172	183	229	234	270	318	355	363	420	472
June	135	149	178	218	243	264	315	374	422	435	472	535
July	148	170	199	230	264	302	364	413	465	491	548	622
August	148	170	199	242	272	293	347	405	467	505	559	606
September	136	158	184	209	237	259	312	355	404	404	463	508
October	119	133	162	191	211	229	274	306	347	359	407	461
November	104	114	146	172	180	203	237	271	305	310	362	390
December	118	140	166	194	201	229	278	306	336	337	405	432

The above data is in a format which is useful for our analysis, we will now plot a heatmap.

sns.heatmap(data=flights)
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()

The heatmap above has transformed the numerical values in the ‘flights’ DataFrame to different colors. The index values(‘month’) of the ‘flights’ DataFrame are used as y-tick labels and column names(‘year’) are used as x-tick labels. The heatmap uses dark colors to display low values and light colors to display high values.

In this example, the heatmap is generated using the default parameter values. Let us customize the appearance of the heatmap by changing the default settings.

Customize your heatmap

Example 1

sns.heatmap(data=flights,vmin=100,vmax=630,annot=True,fmt='d',linewidth=0.3,cbar=True)
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()


In the above heatmap, we have set the lower and upper bounds for the color bar, displayed numerical values in each cell and added borders to the cells. These parameters are defined below:

Parameters:

  • vmin,vmax : values to set a Colorbar range, vmin is the lower bound and vmax is the upper bound for color scaling.If no value is specified then the limits are inferred from minimum and maximum data values in input data.
  • annot : If True, the data values corresponding to each cell are displayed.
  • fmt : String formatting code to use when adding annotations.
  • linewidth : separate each cell of the heatmap using the specified linewidth.
  • cbar : can be used to display or remove colorbar from the plot, default value is True.

Example 2

sns.heatmap(data=flights,center=flights.loc['July',1957])
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()


The above heatmap is plotted by setting the parameter ‘center’ to a numerical value and this value is used as the center of the colormap when plotting data.

In this example, the value in the cell indicated by (July,1957) is the center of the colormap. If you notice carefully, this is actually the new midpoint of the data and the cells to the left of the midpoint are indicated with a color scheme that gradually goes from blue to light green. The cells to the right of the midpoint are indicated with a color scheme that gradually goes from black to red. So a divergent color scheme is applied to the heatmap.

Example 3

sns.heatmap(data=flights,center=flights.loc['July',1960])
plt.title('flights data')
plt.xlabel('year')
plt.ylabel('month')
plt.show()


In the above heatmap, the center of colormap is set with the value indicated by cell – (July,1960). This heatmap uses light colors to display low values and dark colors to display high values. As you go from left to right extreme of the heatmap, the colors change from light to dark shades which clearly shows an increase in the number of passengers with each year. Also the number of passengers is more in the months of July and August compared to other months in any given year, this is evident from the cells corresponding to July and August which are darker compared to the cells above and below.

Example 4

sns.heatmap(data=sample_data,square=True,xticklabels=2, yticklabels=2,cbar_kws={'orientation':'horizontal'})
plt.show()


By passing True to the parameter ‘square’ you can set the shape of the cells to a square.

By default, the heatmap takes the xtick and ytick labels from the index and columns names of the Dataframe. This can be changed using the parameters xticklabels and yticklabels. These two parameters can take any of following values – “auto”, bool, list, int.

You can specify whether the colorbar is displayed horizontally or vertically by using the color bar keyword arguments(cbar_kws).

Example 5 – Masking

Say for example, you want to display only the cells above or below the diagonal of the heatmap . This can be achieved using masking. Let us display the cells above the diagonal in the heatmap using the input DataFrame sample_data.

np.triu() is a method in NumPy that returns the lower triangle of any array passed to it, while np.tril() returns the upper triangle of any array passed to it. Lets pass Dataframe sample_data to the method np.tril().

lower_triangle  = np.tril(sample_data)
lower_triangle
array([[0.10428215, 0.        , 0.        , 0.        , 0.        ],
       [0.26485421, 0.51836401, 0.        , 0.        , 0.        ],
       [0.54617601, 0.50928675, 0.14252839, 0.        , 0.        ],
       [0.34907137, 0.26050913, 0.26700028, 0.13542169, 0.        ],
       [0.84525918, 0.86937953, 0.17806983, 0.91787589, 0.34436948]])

lower_triangle is the new array formed by extracting the lower (tril) triangular part of sample_data, and setting all other elements to zero.

sns.heatmap(data=sample_data,mask=lower_triangle)
plt.show()


The parameter ‘mask’ accepts an array or DataFrame. To the heatmap() function above we have passed the data as sample_data and mask as lower_triangle which will create a mask on the heatmap. Values will be plotted for cells where ‘mask’ is ‘False’ – that is a value of 0.

Below is another example, which displays the cells below the diagonal.

upper_triangle = np.triu(sample_data)
upper_triangle
array([[0.10428215, 0.52008394, 0.58626989, 0.42152029, 0.92470701],
       [0.        , 0.51836401, 0.96535456, 0.54591957, 0.97518597],
       [0.        , 0.        , 0.14252839, 0.99268308, 0.13607794],
       [0.        , 0.        , 0.        , 0.13542169, 0.02502764],
       [0.        , 0.        , 0.        , 0.        , 0.34436948]])
sns.heatmap(data=sample_data,mask=upper_triangle)
plt.show()

Read and plot data from a csv file

The csv file contains data related to the total number of road accidents and the time of occurence in different Indian cities in 2017. The file used in this example is available in this url: https://data.gov.in/resources/stateut-wise-road-accidents-time-occurance-during-2017

Let us read the data in the csv file into a DataFrame using the Pandas read_csv() function.

input_file = pd.read_csv(r'C:\Users\Ajay Tech\Documents\training\visualization\Data\Road_Accidents_2017.csv')

input_file.tail()
States/UTs	06-09hrs (Day)	09-12hrs (Day)	12-15hrs (Day)	15-18hrs (Day)	18-21hrs (Night)	21-24hrs (Night)	00-03hrs (Night)	03-06hrs (Night)	Total Accidents
32	Daman & Diu	5	7	15	16	17	13	6	0	79
33	Delhi	747	858	828	807	1008	1159	714	552	6673
34	Lakshadweep	0	0	0	1	0	0	0	0	1
35	Puducherry	250	256	250	257	257	216	118	89	1693
36	Total	51551	71426	71594	82456	85686	49567	25050	27580	464910

The last row and column in the Dataframe contain the sum of all the numerical values in each row and column respectively. This is actually not required for our analysis, so let us delete the last row and column.

input_file.drop(36, axis=0,inplace=True)
input_file.drop('Total Accidents', axis=1,inplace=True)
input_file.tail()

States/UTs	06-09hrs (Day)	09-12hrs (Day)	12-15hrs (Day)	15-18hrs (Day)	18-21hrs (Night)	21-24hrs (Night)	00-03hrs (Night)	03-06hrs (Night)
31	D & N Haveli	7	6	12	14	14	5	5	4
32	Daman & Diu	5	7	15	16	17	13	6	0
33	Delhi	747	858	828	807	1008	1159	714	552
34	Lakshadweep	0	0	0	1	0	0	0	0
35	Puducherry	250	256	250	257	257	216	118	89

Next we will reorganize the data in the Dataframe into a format which is required for analysis using pivot_table() function.

input_data = input_file.pivot_table(index='States/UTs')
input_data.head()
00-03hrs (Night)	03-06hrs (Night)	06-09hrs (Day)	09-12hrs (Day)	12-15hrs (Day)	15-18hrs (Day)	18-21hrs (Night)	21-24hrs (Night)
States/UTs								
A & N Islands	3	7	20	27	33	41	39	19
Andhra Pradesh	1406	1648	2808	3581	3765	4484	5265	2770
Arunachal Pradesh	42	39	40	24	25	24	23	24
Assam	393	391	749	1206	1262	1340	1121	708
Bihar	495	755	1344	1550	1292	1462	1312	645

Let us generate a heatmap for the first 20 rows in the data.

sns.heatmap(data = input_data.head(20),cmap='YlOrRd',linewidths=0.3,xticklabels=['00-03','03-06','06-09','09-12','12-15','15-18','18-21','21-24'])
plt.xticks(rotation=20)
plt.xlabel('Time of occurence(hrs)')
plt.title('No. of road accidents/time interval of day')
plt.show()

The heatmap uses light colors for low values and dark colors for high values. Few cities in the heatmap show that the number of accidents are more during the interval 9AM – 9PM, for example for the last city (Madhya Pradesh) in the heatmap dark colors in the colormap are used for the cells in this interval. The maximum number of road accidents for most cities occur during 6-9 PM interval, this can be inferred from the color of the cells in this interval.

Colors in Visualization

Colors in Visualization


  Data visualization

Colors in Visualization

Data visualization is the graphical representation of data. One important step in data visualization process is mapping of numbers to colors. When mapping the numbers to colors, choosing the right color combinations is essential because charts that have colors that go well with each other make it easy to read and understand the data and the viewer can also easily perform the reverse mapping back to scalar values.

Basic Terms

Palette: In computer graphics, a palette is the set of available colors.

Hue: Hues refer to the set of pure colors within a color space. Hue defines pure colors as one of the six Primary and Secondary colors.

Saturation: Saturation refers to the intensity of color in an image, that is the strength or weakness of a color.

Lightness: Lightness of a color specifies how “bright” the color should be. 0% means the brightness is 0 and the color is black. 100% means maximum brightness and the color is white.

Color wheel: A color wheel is a visual organizational tool of color hues around a circle that help make the basic categories of color easier to understand. It shows the relationships between primary colors, secondary colors, and tertiary colors. It can be used as a point of reference when creating customized palettes.

Color model: Color models provide various methods to define colors, each model defining colors through the use of specific color components.

Source : http://warrenmars.com/visual_art/theory/colour_wheel/martian_colour_wheel_24_hue_r.png

color_palette() function

The color_palette function in Seaborn returns a list of colors defined in a color palette and these colors can be used in a plot. When the function is called without passing any arguments, it returns the current Matplotlib color cycle. The function can also be used in a with statement to temporarily set the color cycle for a plot or set of plots. The colors defined in a color palette can be displayed using the palplot() function, which plots the color palette in a horizontal array.

import seaborn as sns
current_palette = sns.color_palette()  ## The color_palette function returns a list of RGB tuples.
current_palette

[(0.4541176470588235, 0.7066666666666666, 0.6270588235294118),
 (0.8976470588235295, 0.5929411764705883, 0.4749019607843137),
 (0.5894117647058825, 0.6415686274509804, 0.7596078431372548),
 (0.8511764705882354, 0.5958823529411763, 0.7523529411764707),
 (0.632156862745098, 0.7694117647058825, 0.4070588235294117),
 (0.8776470588235292, 0.7733333333333332, 0.30666666666666687)]
sns.palplot(sns.color_palette(current_palette)) 

## palplot() function displays the colors in the current matplotlib color cycle.

The type of color palette to use in a visualization depends on the nature of the input data. Primarily three types of color palettes exist for data visualization:

  • Qualitative palettes
  • Sequential palettes
  • Diverging palettes

Let us discuss about these palettes.

Qualitative color palettes

Qualitative or Categorical palettes can be used to represent categorical data, where there is no particular ordering of categories. Each color in the palette represents a distinct category. If the number of categories in the input data is more than the number of colors in the palette, the same set of colors will be looped over.

There are six variations of the default color cycle – deep, muted, pastel, bright, dark, and colorblind. These six palettes have varying degrees of saturation and luminance.

sns.palplot(sns.color_palette(palette='deep'))
sns.palplot(sns.color_palette(palette='muted'))

sns.palplot(sns.color_palette(palette='pastel'))


Circular color systems

hsl color model

HSL is short form for Hue, Saturation and Lightness. The HSL color model defines a given color according to its hue, saturation and lightness components.

The hls_palette function in Seaborn can be used to create a palette with evenly spaced colors in HLS hue space. The values for the parameters h, l, and s ranges between 0 and 1. By default, 6 colors in the hsl color model will be returned.

The hls_pallete() function below returns evenly spaced colors beginning from hue=0(red), keeping the saturation and lightness at 0.5.

sns.palplot(sns.hls_palette(n_colors=6, h=0, l=0.5, s=0.5))


The function below returns fully saturated version of same set of colors as above by setting s=1.

sns.palplot(sns.hls_palette(n_colors=6, h=0.5, l=0.3, s=1))

Look at the colors below, the brightness of these colors is 0.8.

sns.palplot(sns.hls_palette(n_colors=10, h=0.2, l=0.8, s=0.65))

husl color model

HSLuv color space has perceptually uniform lightness, which means every color with the same lightness value is perceived as equally bright by humans. HSLuv color space, is a human-friendly alternative to the HSL color space. The husl_palette function in Seaborn can be used to create a palette with evenly spaced colors in HUSL hue space. The values for the parameters h, l, and s ranges between 0 and 1. By default, 6 colors in the husl color model will be returned.

In the below example, the husl_palette returns colors beginning from h=0, with the saturation and luminosity set at 0.6.

sns.palplot(sns.husl_palette(s=0.6,l=0.6)) 

Darker shades of the above set of colors are displayed below by setting luminosity to 0.3

sns.palplot(sns.husl_palette(s=0.7,l=0.3))

Categorical Color Brewer palettes

There are many tools available online that help us to choose and create color palettes that best suit our needs. ColorBrewer is one such tool that offers a number of interesting color palettes of each type.

The choose_colorbrewer_palette() function in Seaborn, helps us to choose palettes from the Color Brewer library. This function, which must be used in a Jupyter notebook, launches an interactive Ipython widget function to choose and customize a color palette. To the choose_colorbrewer_palette() function we need to pass the type of color palette that we want to visualize. Using the interactive widget we can browse through various palettes available and adjust the saturation parameter.

This function, by default, returns the list of colors defined in a color palette, but if you need a Matplotlib colormap that can be used with Matplotlib plotting functions you can set the argument as_cmap to True.

sns.choose_colorbrewer_palette(data_type='s')
interactive(children=(Dropdown(description='name', options=('Greys', 'Reds', 'Greens', 'Blues', 'Oranges', 'Pu…
[(0.9575547866205305, 0.9575547866205305, 0.9575547866205305),
 (0.9012072279892349, 0.9012072279892349, 0.9012072279892349),
 (0.8328950403690888, 0.8328950403690888, 0.8328950403690888),
 (0.7502191464821223, 0.7502191464821223, 0.7502191464821223),
 (0.6434140715109573, 0.6434140715109573, 0.6434140715109573),
 (0.5387158785082661, 0.5387158785082661, 0.5387158785082661),
 (0.440322952710496, 0.440322952710496, 0.440322952710496),
 (0.342883506343714, 0.342883506343714, 0.342883506343714),
 (0.22329873125720878, 0.22329873125720878, 0.22329873125720878),
 (0.10469819300269129, 0.10469819300269129, 0.10469819300269129)]

Sequential color palettes

Sequential color palettes are best to visualize numeric data that progresses from low to high (or vice versa) with light to dark colors. You can use a sequential color palette with a single hue, while changing the saturation and lightness. This allows us to focus our attention on data that have larger values(bright colors). Examples of single hue color palettes are Blues,Greens,Greys,Oranges.

sns.palplot(sns.color_palette('Blues'))

You can also reverse the order of colors in the palette by adding a suffix ‘_r’ to the palette name.

sns.palplot(sns.color_palette('Blues_r'))

sns.palplot(sns.color_palette('Greens_r'))


 If you want darker shades of these colors , then you can add a suffix ‘_d’ to the palette name.

sns.palplot(sns.color_palette('Greens_d'))

Multi-hue sequential color palettes provide a better color contrast so that the colors are easier to differentiate. When multiple hues are used it is good to choose colors that are next to each other on the color wheel because they blend well together, the parameters saturation and lightness are to be adjusted accordingly.

sns.palplot(sns.color_palette('BuGn'))
sns.palplot(sns.color_palette('PuBuGn'))

Cubehelix palette

The Cubehelix system was designed to produce attractive palettes with a huge choice of hue, saturation and brightness. Images using this palette will look monotonically increasing to both the human eye and when printed in black and white.

The cubehelix palettes are a form of the sequential color palettes in which the hue is slightly changed and the brightness is increased linearly.

As shown below, you can pass the palette name and the number of colors ‘n’ to the color_palette() function to display……

sns.palplot(sns.color_palette("cubehelix", 8))


You can call the cubehelix_palette() function through the cubehelix interface provided by Seaborn. This function can be used to generate a sequential palette.

sns.palplot(sns.cubehelix_palette())


Parameters:

  • start — hue to start from
  • rotation — number of rotations the helix makes
  • hue — saturation of the color
  • gamma — Gamma factor to emphasize low or high intensity values
  • light — Intensity of the lightest color in the palette
  • dark – Intensity of the darkest color in the palette
  • as_cmap – If True, return a matplotlib colormap instead of a list of colors
sns.palplot(sns.cubehelix_palette(start=0,rot=2,hue=0.5,gamma=0.3))
sns.palplot(sns.cubehelix_palette(start=0.5,rot=3,hue=1,gamma=1,dark=1,light=0))


It is also possible to generate a cubehelix palette in seaborn using a string-shorthand as shown below:

sns.palplot(sns.color_palette("ch:1.5,r=.4,l=.6"))

Custom sequential palettes

Seaborn comes up with custom sequential palettes such as the light_palette and dark_palette.You can generate a palette using the light_palette function by choosing a base color which blends from light to dark shade. You can use the interactive Ipython widget with choose_light_palette() function to select light palettes.

sns.palplot(sns.light_palette(color='#800080',n_colors=7,reverse=False,input='RGB'))


Parameters: color – HEX code, HTML color name of the chosen color reverse – If True, the order of colours is reversed as_cmap – If True, return a matplotlib colormap instead of a list of colors input – color space to which color values refer : {‘rgb’, ‘hls’, ‘husl’, xkcd’}

Using the dark_palette function, you can generate a palette by choosing a base color which blends from dark shade to color. You can use the interactive Ipython widget with choose_dark_palette() function to select dark palettes.

sns.palplot(sns.dark_palette(color='#b60c26',n_colors=6,reverse=False,input='hsl'))


Diverging color palettes

Diverging color schemes are best used to highlight both high and low extreme values. This color scheme is usually used for data that has a natural mid-point such as a zero. A diverging scheme shows all values below the mid-point as a sequential color scheme using one hue, and all values above the mid-point in a sequential scheme using a different hue. The hue at one extreme gradually tones down to a neutral color at the midpoint, and then the saturation of the second hue gradually increases to the other extreme.

Below we see few diverging palattes.

sns.palplot(sns.color_palette('RdBu',7))
sns.palplot(sns.color_palette('BrBG',7))


Custom diverging palettes

You can generate custom diverging palettes using the diverging_palette() function.

sns.palplot(sns.diverging_palette(h_neg=260,h_pos=350,n=7,s=75,l=50,sep=10,center='light'))


Parameters:

  • h_neg,h_pos- hues for the left and right extremes of the palette, range between 0-359.
  • s – saturation for both the extremes
  • l – lightness for both the extremes
  • sep – size of the intermediate region
  • center – the center of the palette can be ‘light’ or ‘dark’

The set_palette function is used to set the matplotlib color cycle using a seaborn palette.

sns.set_palette(palette=sns.color_palette('Set2'), n_colors=6, desat=0.7, color_codes=False)

from matplotlib import pyplot as plt
tips = sns.load_dataset('tips')
sns.barplot('sex','tip',data=tips)
plt.show()


Plot Styling and Scaling of Elements

Plot Styling and Scaling of Elements


  Data visualization

Table of Contents

Plot styling

Seaborn splits Matplotlib parameters into two independent groups: The first group sets the aesthetic style of the plot and second group scales various elements of the figure.

Let us first see how to customize the look and style of the plots. Seaborn has five built-in themes to style its plots – darkgrid, whitegrid, dark, white, and ticks. The darkgrid is the default theme, but we can change this style to suit our requirements.

We can customize the styles such as background color, color of tick marks, text color, font type etc., using the functions axes_style() and set_style(). Both these functions take same set of arguments.

The axes_style() function defines and returns a dictionary of rc parameters related to the styling of the plots. This function returns an object that can be used in a with statement to temporarily change the style parameters.

The set_style() function is used to set the aesthetic style of the plots, the rc parameters can be customized using this function.

from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns


Below we see the rc parameters returned by the axes_style function. The default settings of the parameters can be changed to fine tune the look of the plots.

sns.axes_style()

{'axes.facecolor': 'white',
 'axes.edgecolor': 'black',
 'axes.grid': False,
 'axes.axisbelow': 'line',
 'axes.labelcolor': 'black',
 'figure.facecolor': (1, 1, 1, 0),
 'grid.color': '#b0b0b0',
 'grid.linestyle': '-',
 'text.color': 'black',
 'xtick.color': 'black',
 'ytick.color': 'black',
 'xtick.direction': 'out',
 'ytick.direction': 'out',
 'lines.solid_capstyle': 'projecting',
 'patch.edgecolor': 'black',
 'image.cmap': 'viridis',
 'font.family': ['sans-serif'],
 'font.sans-serif': ['DejaVu Sans',
  'Bitstream Vera Sans',
  'Computer Modern Sans Serif',
  'Lucida Grande',
  'Verdana',
  'Geneva',
  'Lucid',
  'Arial',
  'Helvetica',
  'Avant Garde',
  'sans-serif'],
 'patch.force_edgecolor': False,
 'xtick.bottom': True,
 'xtick.top': False,
 'ytick.left': True,
 'ytick.right': False,
 'axes.spines.left': True,
 'axes.spines.bottom': True,
 'axes.spines.right': True,
 'axes.spines.top': True}

Scatter Plot

A scatterplot is used to display the correlation between two numerical variables. The values of both the variables are displayed with dots. Each dot on the scatterplot represents one observation from the data set.

We will plot a scatter plot which uses the ‘tips’ dataset to display the tips received on the total_bill, both of which are quantitative variables. The scatter plot is rendered in default theme which is darkgrid style.

tips = sns.load_dataset('tips')
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Default theme - Darkgrid') 
plt.show()

As you can see, the default theme has a light grey background with white gridlines.

Let’s look at another example, in the scatter plot function below, we have passed the ‘ticks’ theme to the style parameter. The ‘ticks’ theme allows the colors of the dataset to show more visibly. Apart from this, we have also changed the axes edgecolor, text color, ticks color in the plot by changing the default rc parameter values.

sns.set_style(style='ticks',rc={'axes.edgecolor': 'b','text.color': 'r','xtick.color': 'r','ytick.color': 'r'})
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Ticks theme')
plt.show()


We can temporarily change the style parameters of a plot by using the axes_style function in a with statement as shown in the example below.

with sns.axes_style(style='whitegrid',rc={'font.family': 'serif','font.serif':'Times'}):
    sns.scatterplot('total_bill','tip',data=tips)
plt.title('Whitegrid theme')
plt.show()


Scaling of plot elements

Next we will see how to scale the various elements in the plot. Seaborn has four preset contexts which set the size of the plot and allow us to customize the plot depending on how it will be presented. The four preset contexts, in order of relative size are – paper, notebook, talk and poster. The notebook style is the default context, which can be changed depending on our requirement.

We can customize the size of the plot elements such as labels,ticks,markers,linewidth etc., using the functions plotting_context() and set_context(). Both these functions take same set of arguments.

The plotting_context() function defines and returns a dictionary of rc parameters related to plot elements such as label size,tick size,marker size. This function returns an object that can be used in a with statement to temporarily change the context parameters.

The set_context() function is used to set the plotting context parameters.

Below we see the rc parameters returned by the plotting_context function. The default settings of the parameters can be changed to scale plot elements.

sns.plotting_context()
{'font.size': 12.0,
 'axes.labelsize': 12.0,
 'axes.titlesize': 12.0,
 'xtick.labelsize': 11.0,
 'ytick.labelsize': 11.0,
 'legend.fontsize': 11.0,
 'axes.linewidth': 1.25,
 'grid.linewidth': 1.0,
 'lines.linewidth': 1.5,
 'lines.markersize': 6.0,
 'patch.linewidth': 1.0,
 'xtick.major.width': 1.25,
 'ytick.major.width': 1.25,
 'xtick.minor.width': 1.0,
 'ytick.minor.width': 1.0,
 'xtick.major.size': 6.0,
 'ytick.major.size': 6.0,
 'xtick.minor.size': 4.0,
 'ytick.minor.size': 4.0}

The scatter plot below uses the ‘tips’ dataset to display the tips received on the total_bill. The scatter plot is rendered in the notebook context which is the default context.

sns.set()
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Notebook context')
plt.show()


In the example below, we have passed ‘talk’ to the context parameter. Apart from this, we have also changed the label size, title size, grid linewidth of the plot by changing the default rc parameter values.

sns.set_context(context='talk',rc={'axes.labelsize': 20.0,'axes.titlesize': 20.0,'grid.linewidth': 2.5})
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Talk context')
plt.show()

We can temporarily change the context parameters of a plot by using the plotting_context function in a with statement as shown in the example below.

with sns.plotting_context(context='poster'):
    sns.scatterplot('total_bill','tip',data=tips)
plt.title('Poster context')
plt.show()


Set () Function

The set() function in Seaborn sets the style and context parameters in one step, see example below.

sns.set(style='white',context='paper',rc={'axes.edgecolor': 'b','axes.titlesize': 20.0})
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Paper context')
plt.show()


If you want to switch to Seaborn default settings, then call the set() function without passing any arguments.

Default settings:

  • context=’notebook’
  • style=’darkgrid’
  • palette=’deep’
  • font=’sans-serif’
  • font_scale=1
  • color_codes=True
sns.set()
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Plot with default settings')
plt.show()

Despine() Fundespine()ction

Spines are the borders on the sides of a graph or plot. By default, a plot has four spines. The despine() function can be used to remove the spines in the plot, by default the top and right spines are removed using this function.

sns.set(style='ticks')
sns.scatterplot('total_bill','tip',data=tips)
plt.title('Plot with four spines/borders')
plt.show()
sns.set(style='ticks')
sns.scatterplot('total_bill','tip',data=tips)
sns.despine()
plt.title('Plot with top and right spines removed')
plt.show()


You can choose to remove all the spines if you think they are unnecessary and distracting, see example below.

sns.set(style='white')
sns.scatterplot('total_bill','tip',data=tips)
sns.despine(left=True,bottom=True)
plt.title('Plot with all spines removed')
plt.show()

Introduction to Seaborn

Introduction to Seaborn


  Data visualization

Table of Contents

Introduction to Seaborn

Seaborn is a data visualization library which provides a high-level interface to draw statistical graphs.  It is built on top of Python’s core visualization library, Matplotlib. Seaborn extends the Matplotlib library for creating aesthetically pleasing graphs. Internally Seaborn uses Matplotlib to draw plots, so it complements the Matplotlib library but is not a replacement to it. Matplotlib is highly customizable, but it is hard to know what settings to tweak to render nice plots. Seaborn comes with a number of customized themes and a high-level interface for controlling the look of Matplotlib figures.   

Seaborn comes with preset styles and color palettes which can be used to create aesthetically pleasing charts with few lines of code. It is closely integrated with the Pandas and Numpy library.

Below are the dependencies of the Seaborn library:

  • Python 3.6+
  • numpy (>= 1.13.3)
  • scipy (>= 1.0.1)
  • pandas (>= 0.22.0)
  • matplotlib (>= 2.1.2)

Once the required dependencies are installed, you are ready to install and use Seaborn.

The latest version of Seaborn can be installed using pip with the command — pip install seaborn

You can also install Seaborn using Anaconda prompt with the command — conda install seaborn

Seaborn is closely integrated with Pandas data structures. The Pandas library has two primary containers of data – DataFrame and Series.

DataFrames – A DataFrame is a collection of data arranged in rows and columns. DataFrames are similar to excel spreadsheets. They are two-dimensional structures, with two axes, the “index” axis and the “columns” axis.

Series – Series is a single column of the DataFrame. So a Pandas DataFrame is a collection of Series objects.

Basic Terms

Quantitative and Qualitative variables

In statistics two types of variables are used: Quantitative and Qualitative variables.

Quantitative: Quantitative variables are numerical values representing counts or measures. Examples: Temperature counts, percents, weight. Quantitative variables are of two types – discrete and continuous:

Discrete variables are numeric variables that have a finite number of values between any two values. A discrete variable is always numeric.

Continuous variables are numeric variables that have an infinite number of values between any two values.

Qualitative: Qualitative variables are variables that can be placed into distinct categories according to some attributes or characteristics. They contain a finite number of categories or distinct groups. Examples: Gender, eye color.

Univariate and Bivariate Data

Statistical data are classified according to the number of variables being studied.

Univariate data: This type of data consists of only one variable. The variable is studied individually and we don’t look at more than one variable at a time.

Bivariate data: This type of data involves two different variables, where the two variables are studied to explore the relationship or association between them.

Loading Datasets

Seaborn comes with a few important datasets that can be used to practice. When Seaborn is installed, the datasets are downloaded automatically. To start working with a built-in Seaborn data set, you can make use of the load_dataset() function. By default, the built-in datasets are loaded as Pandas DataFrame. Let us load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time.

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

tips = sns.load_dataset('tips')

tips.head(10)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
5	25.29	4.71	Male	No	Sun	Dinner	4
6	8.77	2.00	Male	No	Sun	Dinner	2
7	26.88	3.12	Male	No	Sun	Dinner	4
8	15.04	1.96	Male	No	Sun	Dinner	2
9	14.78	3.23	Male	No	Sun	Dinner	2

Bar Plot

A Bar Plot is a visual tool that uses bars to compare data among different groups or categories. The graph represents categories on one axis and discrete values on the other. Let us draw a bar plot using the barplot function defined in Matplotlib library with input from tips dataset.

plt.bar(x=tips['smoker'],height=tips['tip'],data=tips)
plt.xlabel('smoker')
plt.ylabel('total_bill')
plt.title('Tips/Smoker')
plt.show()

plt.bar(x=tips['smoker'],height=tips['tip'],data=tips)
plt.xlabel('smoker')
plt.ylabel('total_bill')
plt.title('Tips/Smoker')
plt.show()

The bar plot above is plotted in Matplotlib, the bars compare the tips received from two groups – Smokers and Non-Smokers. Smokers is a qualitative variable. The Series data is passed to the axes arguments.

Let us again plot a barplot using the Seaborn library as shown below.

sns.barplot(x='smoker',y='tip',data=tips)
plt.title('Tips vs Smoker')
plt.show()


The bar plot above is plotted in Seaborn. To the barplot function we have passed the column names to the x and y parameters, the Dataframe is passed to data parameter. The bars like in the previous example, compare the tips received from two groups – Smokers and Non-Smokers. Notice how the bars are displayed in different colors, also the axes labels are taken from the input data. We can add custom labels to the plot by calling set_xlabel() and set_ylabel functions on the axes object. You can also set the labels using the xlabel() and ylabel() functions defined in the pyplot module of the Matplotlib library.

fig1,axes1 = plt.subplots()
sns.barplot(x='smoker',y='total_bill',data=tips,hue='sex',estimator=np.sum,errcolor='r',errwidth=0.75,capsize=0.2,ax=axes1,)
axes1.set_xlabel('Smoker - Yes/No')
axes1.set_ylabel('Bill amount')
axes1.set_title('Total bill vs Smoker')
plt.show()

We can also create a figure object with multiple axes and render the plots onto a specific axes by using the ‘ax’ argument. If we do not specify any value for the argument, plot is rendered to the current axes.

The ‘hue’ parameter can be used to show information about the different sub-groups in a category. In the above example, the ‘hue’ parameter is assigned to the column ‘sex’ which further categorizes the data and has created two side by side bars. A separate colored bar represents each sub-group and a legend is added to let us know what each sub-category is.

The ‘estimator’ argument can be used to change how the data is aggregated. By default, each bar of a barplot displays the mean(np.mean) value of a variable. Using the estimator argument this behaviour can be changed. The estimator argument can receive a function such as np.sum, len, np.median or any other statistical function.

The red colored cap-tipped lines that extend from the edge of the bars are called Error Bars and they provide an additional layer of detail in the plotted data. Error Bars help to indicate estimated error or uncertainity of a data point. A short Error Bar shows that values are concentrated, indicating that the plotted average value is more likely, while a long Error Bar would indicate that the values are more spread out and less reliable.