Table of Contents
- Histogram for the sum of two dice
- Histogram for the sum of three dice
- Kernel Density Estimate (KDE)
- Univariate Kernel Density Estimate
- Bivariate Kernel Density Estimate
Histogram for the sum of two dice
Let us begin by setting up the input data. Suppose we are conducting an experiment by rolling two dice multiple times, every outcome of the experiment – which is the sum of the faces is recorded. We will visualize the distribution of this data using a Histogram.
We will call the randint() function defined in the numpy library to generate an array of random integers between 2 (smallest possible sum) and 12 (highest possible sum). This array is then passed to the hist() function in Matplotlib library to generate a Histogram. Initially the sample data contains 10 observations.
import pandas as pd import seaborn as sns from matplotlib import pyplot as plt import numpy as np
x = np.random.randint(2,13,10)
array([ 2, 7, 6, 6, 4, 3, 3, 12, 2, 9])
The hist() function automatically calculates the size of each bin of the histogram. The entire range of input data is split into equal-sized bins and then for each bin, the number of data points from the data set that fall into each bin are counted. A histogram displays the distribution of discrete data and it gives an estimate as to where the values are concentrated. From the above histogram, we observe that the outcomes 2, 3 and 6 have occurred 2 times while other outcomes have occurred 1 time.
Histogram for the sum of three dice
In the next example, we will see the distribution of the data when 3 dice are rolled, the sample data contains 50 observations.In :
y = np.random.randint(3,19,50)
array([ 7, 9, 6, 13, 12, 11, 16, 13, 17, 6, 16, 10, 14, 18, 7, 11, 6, 15, 15, 13, 10, 16, 15, 14, 17, 4, 9, 4, 12, 4, 13, 17, 7, 6, 8, 13, 9, 13, 9, 17, 5, 17, 7, 7, 15, 8, 15, 17, 10, 17])
The above histogram displays the distribution of continuous data. Each bin is spaced two numbers apart 4-6, 6-8, 8-10 and so on. Say for example, the default number of bins does not provide sufficient details of our distribution. So lets change few parameters of the histogram – the number of bins and lower range of the bins.
As you can see changing the number of bins and the range affects the appearance of the histogram. When we change the number of bins, the data points get organised or grouped differently. The different grouping affects the appearance of the histogram. The appearance of a histogram can change markedly with different choices of number of bins and end points leading to different interpretations of same data. So histograms can sometimes be misleading. An alternative way is to use Kernel Density Plots which removes the dependency on the end points of the bins.
Kernel Density Estimate (KDE)
In the previous examples, we have used a histogram to estimate the distribution of data. Kernel density estimation(KDE) is another widely used technique for estimating the distribution of data. In a histogram, each value in the dataset is represented using rectangular bars/blocks, and the blocks are piled on top each other into the bins to show the number of values in each bin range. In a KDE plot, each data point in the dataset is represented using different shapes such as a box, triangle, Gaussian curve etc., also each data point contributes a small area around its true value. A KDE plot is produced by drawing a small continuous curve (also called kernel) for every individual data point along an axis, all of these curves are then added together to obtain a single smooth density estimation. Unlike a histogram, KDE produces a smooth estimate. When drawing the individual curves we allow the kernels to overlap with each other which removes the dependency on the end points of the bins.
The kernel width or bandwidth controls the smoothness of the resulting density curve. If the bandwidth is too small, the density estimate has too many peaks making the distribution difficult to interpret. On the other hand, if the bandwidth is too large, then the information about real distribution and subtle features of the data under analysis will be obscured. The bandwidth has to be chosen appropriately such that it highlights all the important features while maintaining smoothness.
Seaborn provides the kdeplot() function to plot a univariate or bivariate kernel density estimate. Lets generate a KDE plot using the dataset ‘x’ created above. The bandwidth of the kernel can be adjusted using the ‘bw’ argument.
Univariate Kernel Density Estimate
A Univariate plot is based on a single variable.
sns.kdeplot(x) plt.title('KDE Plot') plt.show()
Above we see a KDE plot for the dataset ‘x’. Note that the y-axis is in terms of relative frequency and not the number of data points falling into the range. The density curve has one distinct peak indicating the distribution is unimodal.
KDE plot with low bandwidth
sns.kdeplot(x,bw=0.1) plt.title('KDE Plot with low bw') plt.show()
We see that when the bandwidth is too low, the density curve has too many peaks and appears to be multimodal (having multiple peaks).
KDE plot with high bandwidth
sns.kdeplot(x,bw=1) plt.title('KDE Plot with high bw') plt.show()
If the bandwidth is too large, then smaller features in the distribution of the data may disappear. In the above plot small bumps got smoothed out, obscuring important information. This is also known as over-smoothing of the curve.
Customizing the KDE plot
By setting the parameter ‘shade’ to True, the area under the density curve is filled with a color.
If the parameter ‘vertical’ is set to True, density is plotted on the y-axis.
Bivariate Kernel Density Estimate
While a univariate KDE is based on one random variable, a bivariate KDE is based on two independent random variables. The kdeplot() function in Seaborn can be used to generate bivariate KDE which reveals the relationship between the two variables. The bivariate KDE has a three dimensional bell shaped appearance. Even though 3D plots are visually appealing they are difficult to read because some parts of the plot are blocked by other parts and not all applications support rotation of 3D plots.
One common way of displaying information about a 3D surface by using only two dimensions is to use level curves or contour lines. In the next example we will use Contour plots to illustrate bivariate KDEs. Contour plots represent data for three variables in two dimensions. Each contour line is drawn in an xy-plane by varying the x and y values and keeping the third variable as a constant. That means each line is drawn by joining points having equal value or which have the same density. Lets now plot a bivariate KDE by passing the arguments data,data2 to the kdeplot() function which specify the x-coordinates and y-coordinates of the points to be plotted.
z = np.random.randint(3,19,50)
array([ 6, 9, 13, 5, 7, 3, 5, 8, 3, 7, 8, 16, 3, 8, 16, 18, 14, 18, 5, 6, 7, 10, 17, 17, 15, 8, 12, 5, 14, 7, 6, 14, 5, 5, 6, 4, 13, 18, 5, 16, 9, 11, 9, 4, 9, 3, 16, 17, 15, 14])
Here is a two dimensional Kernel Density Estimate shown using contour plot. It displays the joint distribution of random variables ‘y’ and ‘z’. From the plot it is not clear as to which regions have high density or which regions have low density. So let us customize the density plot by color coding the contour lines.
Customizing the Bivariate density plot
Contour plots can use a colormap to color the different levels. We have applied “Greens” colormap to the above plot. “Greens” is a sequential colormap which is best to visualize numeric data that progresses from low to high by gradually increasing darkness and saturation of the color. By default, darker colors represent higher density values.
We can also specify the number of levels that we want to see in the contour plot. If there are too few levels in the map, important details may be lost, while too many levels makes the plot look cluttered.
By setting the parameter ‘shade’ to True, the area between adjacent contour lines can be filled with varying shades of a color. The darker the shade of the color, the higher is the density.
You can add a vertical colorbar to the right side of the plot by setting the parameter ‘cbar’ to True.