Introduction to Seaborn

Introduction to Seaborn


  Data visualization

Table of Contents

Introduction to Seaborn

Seaborn is a data visualization library which provides a high-level interface to draw statistical graphs.  It is built on top of Python’s core visualization library, Matplotlib. Seaborn extends the Matplotlib library for creating aesthetically pleasing graphs. Internally Seaborn uses Matplotlib to draw plots, so it complements the Matplotlib library but is not a replacement to it. Matplotlib is highly customizable, but it is hard to know what settings to tweak to render nice plots. Seaborn comes with a number of customized themes and a high-level interface for controlling the look of Matplotlib figures.   

Seaborn comes with preset styles and color palettes which can be used to create aesthetically pleasing charts with few lines of code. It is closely integrated with the Pandas and Numpy library.

Below are the dependencies of the Seaborn library:

  • Python 3.6+
  • numpy (>= 1.13.3)
  • scipy (>= 1.0.1)
  • pandas (>= 0.22.0)
  • matplotlib (>= 2.1.2)

Once the required dependencies are installed, you are ready to install and use Seaborn.

The latest version of Seaborn can be installed using pip with the command — pip install seaborn

You can also install Seaborn using Anaconda prompt with the command — conda install seaborn

Seaborn is closely integrated with Pandas data structures. The Pandas library has two primary containers of data – DataFrame and Series.

DataFrames – A DataFrame is a collection of data arranged in rows and columns. DataFrames are similar to excel spreadsheets. They are two-dimensional structures, with two axes, the “index” axis and the “columns” axis.

Series – Series is a single column of the DataFrame. So a Pandas DataFrame is a collection of Series objects.

Basic Terms

Quantitative and Qualitative variables

In statistics two types of variables are used: Quantitative and Qualitative variables.

Quantitative: Quantitative variables are numerical values representing counts or measures. Examples: Temperature counts, percents, weight. Quantitative variables are of two types – discrete and continuous:

Discrete variables are numeric variables that have a finite number of values between any two values. A discrete variable is always numeric.

Continuous variables are numeric variables that have an infinite number of values between any two values.

Qualitative: Qualitative variables are variables that can be placed into distinct categories according to some attributes or characteristics. They contain a finite number of categories or distinct groups. Examples: Gender, eye color.

Univariate and Bivariate Data

Statistical data are classified according to the number of variables being studied.

Univariate data: This type of data consists of only one variable. The variable is studied individually and we don’t look at more than one variable at a time.

Bivariate data: This type of data involves two different variables, where the two variables are studied to explore the relationship or association between them.

Loading Datasets

Seaborn comes with a few important datasets that can be used to practice. When Seaborn is installed, the datasets are downloaded automatically. To start working with a built-in Seaborn data set, you can make use of the load_dataset() function. By default, the built-in datasets are loaded as Pandas DataFrame. Let us load the ‘tips’ dataset which consists of the tips received by a waiter in a restaurant over a period of time.

from matplotlib import pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns

tips = sns.load_dataset('tips')

tips.head(10)

	total_bill	tip	sex	smoker	day	time	size
0	16.99	1.01	Female	No	Sun	Dinner	2
1	10.34	1.66	Male	No	Sun	Dinner	3
2	21.01	3.50	Male	No	Sun	Dinner	3
3	23.68	3.31	Male	No	Sun	Dinner	2
4	24.59	3.61	Female	No	Sun	Dinner	4
5	25.29	4.71	Male	No	Sun	Dinner	4
6	8.77	2.00	Male	No	Sun	Dinner	2
7	26.88	3.12	Male	No	Sun	Dinner	4
8	15.04	1.96	Male	No	Sun	Dinner	2
9	14.78	3.23	Male	No	Sun	Dinner	2

Bar Plot

A Bar Plot is a visual tool that uses bars to compare data among different groups or categories. The graph represents categories on one axis and discrete values on the other. Let us draw a bar plot using the barplot function defined in Matplotlib library with input from tips dataset.

plt.bar(x=tips['smoker'],height=tips['tip'],data=tips)
plt.xlabel('smoker')
plt.ylabel('total_bill')
plt.title('Tips/Smoker')
plt.show()

plt.bar(x=tips['smoker'],height=tips['tip'],data=tips)
plt.xlabel('smoker')
plt.ylabel('total_bill')
plt.title('Tips/Smoker')
plt.show()

The bar plot above is plotted in Matplotlib, the bars compare the tips received from two groups – Smokers and Non-Smokers. Smokers is a qualitative variable. The Series data is passed to the axes arguments.

Let us again plot a barplot using the Seaborn library as shown below.

sns.barplot(x='smoker',y='tip',data=tips)
plt.title('Tips vs Smoker')
plt.show()


The bar plot above is plotted in Seaborn. To the barplot function we have passed the column names to the x and y parameters, the Dataframe is passed to data parameter. The bars like in the previous example, compare the tips received from two groups – Smokers and Non-Smokers. Notice how the bars are displayed in different colors, also the axes labels are taken from the input data. We can add custom labels to the plot by calling set_xlabel() and set_ylabel functions on the axes object. You can also set the labels using the xlabel() and ylabel() functions defined in the pyplot module of the Matplotlib library.

fig1,axes1 = plt.subplots()
sns.barplot(x='smoker',y='total_bill',data=tips,hue='sex',estimator=np.sum,errcolor='r',errwidth=0.75,capsize=0.2,ax=axes1,)
axes1.set_xlabel('Smoker - Yes/No')
axes1.set_ylabel('Bill amount')
axes1.set_title('Total bill vs Smoker')
plt.show()

We can also create a figure object with multiple axes and render the plots onto a specific axes by using the ‘ax’ argument. If we do not specify any value for the argument, plot is rendered to the current axes.

The ‘hue’ parameter can be used to show information about the different sub-groups in a category. In the above example, the ‘hue’ parameter is assigned to the column ‘sex’ which further categorizes the data and has created two side by side bars. A separate colored bar represents each sub-group and a legend is added to let us know what each sub-category is.

The ‘estimator’ argument can be used to change how the data is aggregated. By default, each bar of a barplot displays the mean(np.mean) value of a variable. Using the estimator argument this behaviour can be changed. The estimator argument can receive a function such as np.sum, len, np.median or any other statistical function.

The red colored cap-tipped lines that extend from the edge of the bars are called Error Bars and they provide an additional layer of detail in the plotted data. Error Bars help to indicate estimated error or uncertainity of a data point. A short Error Bar shows that values are concentrated, indicating that the plotted average value is more likely, while a long Error Bar would indicate that the values are more spread out and less reliable. 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: