Table of Contents
- Introduction to Violin plots
- Creating plots using violinplot() function
Introduction to Violin plots
A Box plot provides statistical information such as the median, range and quartiles but it does not show the distribution of a data set. If a distribution has multiple peaks it is not easy to figure this out using box plots. Comparatively, Violin plots show the detailed distribution of a numeric variable. Violin plot is a combination of a Box Plot and a Density Plot, it is a box plot with a density plot on each side. The density plots on either side of the box plot are symmetric as they show the distribution of the same numeric variable.
Inside of each violin is a box plot of the data, the left and right edges of the inner box represent the first and third quartiles, and the white circle inside the box is at the median of the data. The thin black lines extending from the box represent the minimum and maximum values in the data.
The density plot estimates the probability density of a variable producing a smooth curve. A violin plot shows the distribution’s density using the width of the plot.
Creating plots using violinplot() function
Create a violin plot (Example1)
The violinplot() function in Seaborn is used to visualise the distribution of numeric data and also compare different categories or groups. Let us graph some violin plots using this function.
For our analysis we will load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time. Let us visualize the quantitative data in column ‘total_bill’, this column is passed to the parameter ‘x’.
import pandas as pd import seaborn as sns from matplotlib import pyplot as plt import numpy as np
tips = sns.load_dataset('tips')
total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
sns.set(style="whitegrid") sns.violinplot(data=tips,x='total_bill',orient='h') plt.title('Total bill') plt.show()
The figure above displays a violin plot which is a combination of a box plot and a density plot. Notice that the density plots on either side of the box plot are symmetric as they both display the distribution of the column ‘total_bill’. The x-axis displays the range of values in the column ‘total_bill’.
The range of the box plot is approximately 38, excluding the outliers. The median, indicated by the white circle is somewhere around 17 and is closer to the lower end of the box, which means the data points on the left side of the median are clustered. The median is far away from the upper end of the box, which means the data points on the right side of the median are scattered.
Notice that the whisker on the upper end of the box is longer. If one whisker is longer than the other, the distribution of the data is skewed in the direction of the longer whisker. Overall the distribution is positively skewed(skewed right). A distribution that has most of the values clustered on the left side of the distribution and has a long right tail is said to be positively skewed.
Next let us interpret the density plot. The plot has one distinct peak. It is clear from the above plot that the right tail is longer compared to the left tail, which means the distribution consists of few extremely large values. A long right tail indicates that the distribution is right-skewed or positively skewed. In a right skewed distribution, the mean is greater than the median because the tail of the distribution pulls the mean. This shows that the mean is largely influenced by extreme values whereas the median is not strongly influenced by few extreme values. The density plot also confirms that the data in column ‘total_bill’ is positively skewed.
Create a violin plot (Example2)
Next let us plot the data in column ‘tip’ using the violinplot() function.
sns.set(style="whitegrid") sns.violinplot(data=tips,x='tip') plt.title('Tips') plt.show()
The distribution of ‘tip’ is bimodal. There are two peaks – one at around 2 and the other slightly greater than 3. The violin plot shows a detailed distribution of the data including the peaks and positions, this information would have been missed in a box plot.
Compare violin plots of different categories
In the next example, we will compare the tips received from the two groups male and female customers. Lets plot our data using the violinplot() function using the ‘sex’ and ‘tip’ columns.
sns.set(style="whitegrid") sns.violinplot(data=tips,y='sex',x='tip') plt.title('Tips') plt.show()
The categorical variable ‘sex’ is on the y-axis and the quantitative variable ‘tip’ is on the x-axis.
Let us compare the two violin plots – Male and Female:
- Notice that both the density plots rise up to a certain point and then the Female falls gradually. But the Male has another peak value at approximately 3.5.
- Male and Female density plots look similar for low values.
- Male has a more elongated distribution compared to Female.
- Median for Male is slightly pulled towards Q3, for Female the median is at the center.
- Both the distributions are positively skewed.
Overall, it appears that females pay less tips compared to the males.
Compare violin plots of different sub-categories
Previously we have split the data into two groups based on the categorical variable. Let us further split these groups into sub-groups by introducing a third variable.
In the next example, the data in column ‘smoker’ is passed to the ‘hue’ parameter, which further categorizes the data and creates two side by side violin plots for each group that is Male and Female. A separate colored violin plot represents each sub-group and a legend is added to let us know what each sub-category is. Using violin plots let us see who tip better.
sns.violinplot(data=tips,x='tip',y='sex',hue='smoker') plt.title('Tips') plt.show()
In the plot above we notice that Male-Smoker has a long-tail distribution compared to any other group. Comparing the violin plots in each category, we see that smokers have long-tailed distribution compared to non-smokers.
Split violin plots
Violin plots are useful for comparing distributions across different categories. By setting the parameter ‘split’ to True, we can split each violin in half to see the difference between two categories. Each split violin plot represents a category, so we can compare two groups in the same violinplot. The hue variable is a binary which has two possible values – ‘smoker’ and ‘non-smoker’. We can display the distribution of these two sub-categories on each side of the violin by setting the parameter split to ‘True’. This is very useful to directly compare the widths of the distributions.
sns.violinplot(data=tips,x='tip',y='sex',hue='smoker',split=True) plt.title('Tips') plt.show()
The box plot inside the first violin displays the distribution of the Male group, which means it does not display the distribution of each sub-group separately. Similarly the box plot inside the second violin displays the distribution of the Female group. It can seen from the plot that whether it is Male or Female category, smokers have a long-tail distribution.
Customizing the violin plots
Instead of displaying a Box plot inside the density plot it is possible to only display the quartile positions using vertical lines by setting the parameter ‘inner’ to ‘quartile’. This is useful to reduce the visual noise. The violin plot below displays the quartiles Q1, Q2 (median) and Q3 using vertical lines at the respective positions.
sns.set(style="whitegrid") sns.violinplot(data=tips,x='tip',y='day',hue='sex',inner='quartile',split=True) plt.title('Tips') plt.show()
It is also possible to display all the datapoints in the distribution by setting the parameter ‘inner’ to ‘points'(dots) or ‘sticks'(vertical lines).
sns.set(style="whitegrid") sns.violinplot(data=tips,x='tip',y='day',hue='sex',inner='points',split=True) plt.title('Tips') plt.show()
sns.set(style="whitegrid") sns.violinplot(data=tips,x='tip',y='day',hue='sex',inner='sticks',split=True) plt.title('Tips') plt.show()
Passing ‘None’ to the parameter ‘inner’ will draw unadorned violins. If no value is passed, a box plot will be displayed by default.