Distribution plot

Distribution Plot

  Data visualization

Table of Contents

Experimental and Theoretical Probabilities

Suppose you conduct an experiment where a fair coin is tossed ‘n’ number of times and every outcome – heads or tails is recorded. So we now have data that consists of all possible outcomes of the experiment. Say we want to know the probability of the coin landing heads up, for which we have to calculate the Experimental probability. Experimental probability refers to the probability of an event occurring when an experiment was conducted, so it is based on direct observations. Experimental probability is the ratio of the number of trials with a certain outcome to total number of trials.

Experimental probability = Number of event occurrences/Total number of trials

There is another type of probability called theoretical probability where you do not conduct an experiment. In order to determine the theoretical probability of an event occurring, you need to analyze what could happen using the information available, use an existing formula or models.

Theoretical Probability of certain event = Number of favourable outcomes/Total number of possible outcomes

The results of the experimental probability and theoretical probability may differ because experimental probability is based on how the event actually occurred during the experiment while theoretical probability is based on the study of the possible outcomes that can occur for the given experiment. Experimental probability will get closer to theoretical probability as the number of trials increase that is when ‘n’ is extremely large.

Let us now see how to compare the experimental probability with theoretical probability using Distribution plots in Seaborn library.

Introduction to Distribution Plots

Distribution plots plot a univariate distribution of data. They are used to visually assess the distribution by comparing the observed data and expected outcomes. A distribution plot displays two different plots for the same variable. A histogram can be used to represent the observed data which is superimposed with a density curve that represents the expected outcomes.

Let us load the ‘tips’ dataset and pass the column data to the distplot() function for our analysis.

seaborn.distplot(a, bins=None, hist=True, kde=True, rug=False, fit=None, hist_kws=None, kde_kws=None, rug_kws=None, fit_kws=None, color=None, vertical=False, norm_hist=False, axlabel=None, label=None, ax=None)

Create Distribution Plots

Overlay KDE plot on histogram

import seaborn as sns
from matplotlib import pyplot as plt

tips = sns.load_dataset('tips')
   total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4

By default the distplot displays a histogram overlaid with a density curve. If you want to omit the histogram, the parameter ‘hist’ can be set to False. If you want to omit the density curve, the parameter ‘kde’ can be set to False.

Coming to the above distplot, the horizontal axis of the histogram represents the entire range of data values. The vertical axis represents the relative frequency. The relative frequency is the count of data values that fall into a particular interval divided by the total number of counts. Notice that the vertical axis is rescaled to reflect relative frequencies rather than counts. Also, the area of all of the bars in the relative frequency histogram must sum to 1. The area under the density curve represents the theoretical probability. Just as in a histogram the area under the density curve equals one.

As you can notice from the plot, the distribution of ‘total_bill’ is unimodal and skewed to the right, which is evident from both the histogram and the density curve. The histogram and the density curve match well with each other indicating that the observed data and expected outcomes are very close.

Overlay Rug plot on KDE


The distplot above displays a density curve and a rug plot. The rug plot can be displayed by setting the parameter ‘rug’ to True. A rug plot can be used to display the data distribution of a variable. For each data value, a vertical line is drawn on a single axis. The height of the vertical lines is arbitrary. A rug plot alone does not provide adequate information of the data distribution so they are used alongside other plots such as a histogram, kernel density estimate etc.,

If you notice the above plot, most of the values are clustered around the range 10-28. As we move towards the tail ends there are relatively few values.

Overlay Normal Distribution curve on histogram

from scipy.stats import norm

In the previous example, we have superimposed a kernel density curve over the histogram in order to compare the theoretical probability with experimental probability. It has been observed that ‘total_bill’ is right_skewed and the histogram matches well with the density curve.

In this example, we will assume that the data in column ‘total_bill’ has a normal distribution. With this assumption, we will overlay a normal distribution curve over the histogram. The scipy.stats module contains a large number of probability distributions, let us import the normal distribution function from this module and generate it using the distplot function. The normal distribution function indicated by ‘norm’ is passed to the argument ‘fit’. Set the parameter ‘kde’ to False so that you only see the histogram and normal distribution curve.

In the above plot, the normal distribution curve displayed in black color represents the theoretical probability and it has the following properties:

— the mean, median and mode are equal and are located at the center of the distribution.

— it is a bell-shaped curve. A bell curve has most of the values at the center of the distribution, about the mean, and
relatively few values in the tails.

— it is symmetric about the mean value and has no skew, which means that the mean is at the center and divides the area
into equal halves.

If you observe the histogram, the distribution is skewed to the right. So most of the values are small and are concentrated around the left tail of the distribution, while the right tail consisting of few exceptionally large values is stretched to the right. These exceptional values influence the mean and pull it to the right, so the mean will be greater than the median. We can infer from the distplot above that the data in column ‘total_bill’ is not normally distributed, as the histogram and the normal distribution curve do not match. Note that the normal probability model cannot be applied to skewed data.

Customizing the Distribution Plots


We can change the appearance of the distplot by specifying values for the keyword arguments ‘hist_kws’, ‘kde_kws’, ‘rug_kws’. We can change the properties such as color, linewidth, linestyle, transparency of the lines etc., of the various plots.

%d bloggers like this: