Table of Contents
- Spread of data and Central tendency
- Introduction to Box plots and five-number summary
- Box plot for symmetric distribution
- Box plot for negatively skewed distribution
- Box plot for negatively skewed distribution with less spread
- Creating plots using boxplot() function
Spread of data and Central tendency
To analyze the distribution of quantitative data, we use two important features: spread of data and central tendency. The spread of data or the variability of the values in the data indicates how spread out a distribution is. Central tendency provides a descriptive summary of a data set through a single value that reflects the center of the data distribution.
Measure the spread of data
Two most commonly used methods to measure the spread of a distribution are Range and Interquartile Range (IQR).
Range: The range is the difference between the highest and lowest values in a data set and is the simplest measure of spread.
Interquartile Range(IQR): IQR indicates the spread of the middle 50% of the data.
In the above figure, we have two datasets Height_GroupA and Height_GroupC which represent the heights (in centimeters) of students belonging to different groups. Both groups have 15 students each. The data points for both the datasets are displayed on a number line in the figure.
The range of Height_GroupA is 78.6 while the range of Height_GroupC is 28.4. If the variation of values is more then the values lie far away from each other, if the variation of values is less then the values lie very close to each other. It is also evident from the above figure that the values in Height_GroupA are more dispersed or spread out compared to the values in the dataset Height_GroupC.
Measure the Central tendency
The three most common measures of central tendency are the mean, median, and mode.
Mean: The mean is the sum of the values in a set of data divided by the number of values in the data set.
Median: The median is the middle value in the data set when the values are arranged in ascending or descending order.
Mode: The mode is the most frequently occurring value in the dataset.
The values in Height_GroupC range between 78.4 to 106.8 and its mean value is 96.1, which can be used to represent the values in the data set. Let us modify the last value in this data set to 300. Can you use the new mean which is equal to 108.9 to represent all the values in this group? The new mean is nowhere close to any of the values in the data set and does not reflect the typical height of a student. The mean is not a robust tool because it is largely influenced by very low or high values in the data.
For distributions that have extreme values we can use the median instead of mean to represent the entire data set with a single value. The median is the middle value of an ordered data set. Half of the data lies below the median and the other half above it. Median values are less likely to be influenced by very high or very low values in the data set. In the above figure, number enclosed in a circle is the median of the data set.
Introduction to Box plots and five-number summary
The spread or variability of a distribution can be graphically represented using Box plots. Box or Box-and-Whisker plots are used to visually represent the distribution of numerical data. Box plots also display the spread or range of data and show any signs of skewness in your data. Skewness in statistics represents an imbalance, it measures the lack of symmetry in data distribution.
Box plots are constructed using five key values. The key values are called a five-number summary. The five-number summary provides information as to where the data in a data set is clustered or where the data is scattered.
A five-number summary consists of the below numbers:
Minimum: The lowest value in the data, excluding outliers.
First Quartile (Q1): Twenty-five percent of values in the data fall below the first quartile value.
Median(Q2): When the data values are arranged in increasing order, median is the number located at the center of these values, it is generally used for skewed distributions.
Third Quartile (Q3): Seventy-five percent of the values fall below the third quartile.
Maximum: The highest value in the data, excluding outliers.
The first quartile, median and the third quartile divide the data into four equal parts, so that there will be approximately equal number of values in each of the parts.
As the name indicates box plots use a rectangle (or a box) for plotting and it has two parts, a box and a set of whiskers. Whiskers are the lines that extend from the box out to the lowest and highest values that are not outliers. Outliers are the data points that lie very far away from the whiskers, they are unusually large or small values in the data. Outliers can affect the overall data due to its very high or low extreme values. In a box plot, outliers are usually shown as dots on the either ends of the box away from the rest of the data.
Box plot for symmetric distribution
Observe the above box plot for the dataset Height_GroupA. The lower edge of the rectangle denotes the First quartile, the upper edge denotes the Third quartile and the line at the centre is the median. The median is located at the center of the box, and it divides the dataset into two equal parts. When the median is at the center of the box, and the whiskers are about the same distance from the median on both sides of the box, then the distribution is said to be symmetric.
Range –> 169.2 – 90.6 = 78.6
IQR –> 159.5 – 99.9 = 59.6
The Range and IQR have high values which means the spread of values in the dataset is large.
Box plot for negatively skewed distribution
The median divides the dataset into two equal parts. In the above box plot, the median is not located at the center of the box but is drawn closer to the third quartile(Q3). The median is closer to the upper end of the box because it gets pulled in the direction of cluster of values. Observe the values to the left of the median, they are more spread out or scattered. In this case, the distribution is said to be negatively skewed (skewed left). A distribution that has most of the values clustered on the right side of the distribution and has a long left tail is said to be negatively skewed.
- Range –> 169.2-90.6 = 78.6
- IQR –> 159.5-99.8 = 59.7
The Range and IQR have high values which means the spread of values in the dataset is large.
Box plot for negatively skewed distribution with less spread
The median for the above box plot is closer to the third quartile(Q3) for the same reason as explained in the earlier example. In this example, the whisker on the lower end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. In this example, the distribution is negatively skewed (skewed left).
- Range –> 106.8-78.4 = 28.4
- IQR –> 104.7-90.9 = 13.8
Now that we have a basic understanding of box plots, let us graph some box plots using the Seaborn library. The boxplot() function in Seaborn is used to visualise the distribution of numeric data and also compare different categories or groups.
Creating plots using boxplot() function
Box plot for single distribution
Box plots can be used to visualize single or multiple distributions. Let us look at an example that visualizes a single distribution. For our analysis we will load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time. The input data can be passed directly to the x, y, and/or hue parameters of the boxplot() function. Let us visualize the quantitative data in column ‘total_bill’, this column is passed to the parameter ‘x’.
import pandas as pd import seaborn as sns from matplotlib import pyplot as plt import numpy as np
tips = sns.load_dataset('tips')
total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
sns.boxplot(data=tips,x='total_bill') plt.title('Total bill') plt.show()
The box plot is placed in the x-y plane and it displays the distribution of the quantitative variable using the five-number summary plotted along x-axis. From the above box plot, we see that the amount spent by the customers in the restaurant ranges between 3 – 41 $, excluding the outliers. The median is somewhere around 17 and is closer to the lower end of the box, which means the data points on the left side of the median are clustered. The median is far away from the upper end of the box, which means the data points on the right side of the median are scattered.
Notice that the whisker on the upper end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. Overall the distribution is positively skewed(skewed right). A distribution that has most of the values clustered on the left side of the distribution and has a long right tail is said to be positively skewed.
The dots located outside the right whisker are the outliers or the unusually large values in the data.
The above box plot is oriented horizontally, orientation can be changed by explicitly passing ‘v’ or ‘h’ to parameter ‘orient’.
Compare box plots of different categories – 1
In the next example, we will compare the tips received from the two groups male and female customers. Let us plot our data using the boxplot() function using the ‘sex’ and ‘tip’ columns.
sns.boxplot(data=tips,x='sex',y='tip') plt.title('Tips received from male and female customers') plt.show()
In the tips dataset, ‘sex’ is a categorical grouping variable that splits the data into two parts- male and female customers. So we have two box plots generated for each category. The categorical variable ‘sex’ is mapped to the x-axis and the quantitative variable ‘tip’ is mapped to the y-axis.
Let us compare the two box plots – Male and Female:
- The two box plots have roughly identical medians.
- Male is more spread out compared to Female.
- Male has multiple outliers whereas Female has one outlier.
- Both the distributions are positively skewed.
Overall, it appears that females pay less tips compared to the males.
Compare box plots of different categories – 2
In this example, let us analyse the effects of the day of week on the bill amount at the restaurant. Let us pass the data in columns ‘day’ and ‘total_bill’ to the arguments ‘x’ and ‘y’ of the boxplot() function.
sns.boxplot(data=tips,x='day',y='total_bill',palette='Set1') plt.title('Sales Analysis') plt.show()
Looking at the four boxplots Thur, Fri, Sat and Sun we can say that the customer spending is more during the weekends.
Side-by-side box plots
Previously we have split the data into groups based on the categorical variable. Let us further split these groups into sub-groups by introducing a third variable.
In the next example, the data in column ‘smoker’ is passed to the ‘hue’ parameter, which further categorizes the data and creates two side by side box plots for each group that is Male and Female. A separate colored boxplot represents each sub-group and a legend is added to let us know what each sub-category is. Using boxplots let us see who tip better.
sns.boxplot(data=tips,x='sex',y='tip',hue='smoker') plt.title('Grouped Boxplots') plt.show()
It is evident from the box plots that Males tip more than Females. If we compare the sub-groups (Smoker/Non-Smoker), Smokers tip better than Non-Smokers.