Table of Contents
- Introduction to Joint plots
- Create Joint plots using the jointplot() function
Introduction to Joint plots
In the previous tutorial, we have discussed Distribution plots which consist of two plots – one plot superimposed on the other. A Jointplot comprises three plots. Out of the three, one plot displays a bivariate graph which shows how the dependent variable(Y) varies with the independent variable(X). Another plot is placed horizontally at the top of the bivariate graph and it shows the distribution of the independent variable(X). The third plot is placed on the right margin of the bivariate graph with the orientation set to vertical and it shows the distribution of the dependent variable(Y). It is very helpful to have univariate and bivariate plots together in one figure. This is because the univariate analysis focuses on one variable, it describes, summarizes and shows any patterns in your data and the bivariate analysis explores the relationship between two variables and also describes the strength of their relationship. The function jointplot() in Seaborn library creates a scatter plot with two histograms at the top and right margins of the graph by default. Let us load the ‘tips’ dataset and pass the column data to the jointplot() function for our analysis.
seaborn.jointplot(*, x=None, y=None, data=None, kind=’scatter’, color=None, height=6, ratio=5, space=0.2, dropna=False, xlim=None, ylim=None, marginal_ticks=False, joint_kws=None, marginal_kws=None, hue=None, palette=None, hue_order=None, hue_norm=None, **kwargs)
Create Joint plots using the jointplot() function
import seaborn as sns from matplotlib import pyplot as plt
tips = sns.load_dataset('tips')
total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 3 23.68 3.31 Male No Sun Dinner 2 4 24.59 3.61 Female No Sun Dinner 4
The above plot displays a scatterplot with two histograms at the margins of the graph. If you observe the scatterplot, there seems to be a positive relationship between the columns ‘total_bill’ and ‘tip’, because if the values of one variable increase so does the other. The strength of the relationship appears to be moderate because the points are scattered in the graph. The marginal histograms are both right-skewed as most values are concentrated around the left side of the distribution while the right side of the distribution is longer. Outliers are the data points that lie far away from the rest of the data values, in the graph we can see outliers in the scatterplot as well as the histograms.
Scatterplot with color dimension
In the above plot, the data points for smoker and non-smoker are plotted in different colors by setting the ‘hue’ parameter to column ‘smoker’. See how the two levels in the column ‘smoker’ can be easily distinguished. Coming to the marginal plots, instead of histograms, density plots are plotted on both the margins showing the data distribution for the two levels of the hue variable separately.
Kernel density plots in a Jointplot
The jointplot creates a scatterplot with two marginal histograms by default. If required different plots can be displayed on the main plot by setting the parameter ‘kind’ to ‘scatter’, ‘kde’, ‘hex’, ‘hist’ etc., The parameter ‘kind’ is set to ‘kde’ in the above function and so the jointplot displays a bivariate density curve on the main plot and univariate density curves on the margins. Also notice that the density curves for the two levels of the hue variable are plotted in different colors.
A regression line or “line of best fit” gives a visual presentation of the relationship of a dependent variable with one or many independent variables. The line is drawn such that it lies close to all the data points as much as possible. The regression line can be computed using mathematical equations, using this equation we can predict the dependent variable for different values of the independent variable. The jointplot() function above is called by setting the parameter ‘kind’ to ‘reg’ and so a regression line is drawn on the scatter plot. We can identify outliers using the regression line on a scatter plot. The outliers are the points that are farthest from the regression line. It can be observed that the scatterplot above has few outliers.
Hexagonal bin plotting
One of the drawbacks of scatterplots is Overplotting. When you have large volumes of data to plot, scatterplots suffer from overplotting meaning the data points in the plot overlap with each other making it difficult to interpret the data. Overplotting can be overcome by grouping the values by a range of values. This is known as binning. First the entire plot is divided into a grid of bins. We can use grids of different shapes such as a triangle, square, hexagon etc., Every bin in the plot represents an interval and it contains all the data points within the defined x and y value ranges. The number of points falling into each bin are counted and the hexagons are colored based on a color gradient. Darker colors indicate that the data points are concentrated in that region, while white bins indicate that there is no data.
The ‘kind’ parameter is set to ‘hex’ in the above function call and the jointplot displays the relation between the columns ‘total_bill’ and ‘tip’ using hexagonal bins. The data points are binned into the hexagons and the distribution is displayed using different colors.
The ‘kind’ parameter is set to ‘hist’ in the above function call and the jointplot displays a two-dimensional histogram. A two-dimensional histogram shows the distribution of data for two continuous numeric variables. For a one-dimensional histogram the height of the bars represent the count, in a two-dimensional histogram every bin in the plot represents an interval and it contains the combined frequency of occurrence of the values in both the columns. Notice that the main plot is filled with square bins and are colored based on a color gradient.
Change layout of the plot
You can adjust the size of the figure, the ratio of joint axes height to marginal axes height, space between joint and marginal axes by passing numeric values to the corresponding parameters as shown in the above function call.