Histogram


  Data visualization

Table of Contents

Histogram

Histogram

A frequency distribution is a table that shows the number of times distinct values in a dataset occur. Histograms are used to evaluate the frequency distribution of a given variable by visually displaying the number of data points occurring in a certain range of values, histograms are useful when there are large datasets to analyse. Similar to a bar graph, data in a histogram are represented using vertical bars or rectangles. So a histogram appears similar to bar graphs, but the bars in a bar graph are usually separated whereas in histograms the bars are adjacent to each other.

Say for example, you are conducting an experiment and you want to visually represent the outcome of the experiment. In this experiment, you are rolling two dice 1000 times, the outcome of each event is recorded by appending the outcome to a list.

outcomes = [1,5,6,3,3,2,4,1,6,11,12,10,7,8,9,12,11,…]

If you want to see the pattern of the outcomes, it is difficult to analyse the list. We can visualize the pattern by generating a histogram showing the frequency of occurrence for the sum of two dice rolls. Histograms are useful for displaying the pattern of your data and getting an idea of the frequency distribution of the variable. To plot a histogram, the entire range of input dataset is split into equal sized groups or bins. A bar is drawn for each bin with the height proportional to the number of values in the input data that fall under the specified bin.

Plot a Histogram with random numbers

The histogram below is plotted with random numbers using the ‘hist’ function defined in the pyplot module. The rand function defined in the numpy library creates an array of specified shape and fills it with random numbers from 0 (included) to 1 (excluded).

%matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
input_data = (np.random.rand(10**3))
plt.hist(input_data,bins=50,color='r',alpha=0.4)
plt.title('Histogram')
plt.xlabel('bins')
plt.ylabel('frequency')
plt.show()

In the above example, the rand function has generated 1000 random numbers and using the ‘hist’ function these random numbers are distributed in 50 different bins. It can be observed from the above histogram that the distribution of random numbers is more in some bins than the other bins. You can generate random numbers to the order of 10^4,10^5,10^6 and see how the values are distributed.

Plot a Histogram to analyze Airline On-time performance

The U.S. Department of Transportation’s (DOT) – Bureau of Transportation Statistics (BTS) releases a summary of statistics and basic analysis on airline performance each month. This dataset is a summary of different air carriers showing their departure delays, arrival delays, scheduled departure, etc. Let us analyse the flight data released by BTS. For this example, I have downloaded data from the following website – (https://transtats.bts.gov/ONTIME/Departures.aspx) into a csv file. This data is collected at JFK International Airport for American Airlines carrier during Jan’19.

Let us plot a histogram which shows the distribution of departure delays(in minutes) of all flights. The delay in departure is calculated as a difference in minutes between scheduled and actual departure time. In the input dataset, early departures are represented as negative numbers and on-time departures are represented with a zero.

import csv
from matplotlib import pyplot as plt
with open (r'C:\Users\Ajay Tech\Documents\training\visualization\Data\flight_delay_american.csv') as input_file1:
    csv_file = csv.reader(input_file1)
    header = next (csv_file)
    delay_min = []
    for row in csv_file:
        delay_min.append(int(row[5]))

bins = [-50,0,50,100,150,200,250,300,350,400,450,500,550,600,650,700,750]
plt.hist(delay_min,bins=bins,log=True,color='c')    
plt.axvline(np.mean(delay_min), color='r', linestyle='dashed', linewidth=1)
plt.title('Histogram of Departure Delays(AA)')
plt.xlabel('Delay(min)')
plt.ylabel('No of flights')
plt.xticks(bins,rotation=30)
plt.show()

In the above script the yscale is set to log scale instead of normal scale because log scale allows us to visualize variations that would otherwise be barely visible. We have marked the average departure delay time on the histogram with a vertical reference line drawn using the axvline function. The axvline function plots a line across the x-axis which can be used to highlight specific points on the histogram. The dotted vertical line on the histogram indicates that on an average, the American Airlines flights departing from JFK airport took off 7 minutes late in Jan’19.

Let us also see the performance of another carrier at JFK airport for the same period.

with open (r'C:\Users\Ajay Tech\Documents\training\visualization\Data\flight_delay_jetblue.csv') as input_file2:
    csv_file = csv.reader(input_file2)
    header = next (csv_file)
    delay_min = []
    for row in csv_file:
        delay_min.append(int(row[5]))

bins = [-50,0,50,100,150,200,250,300,350,400,450,500,550,600,650,700,750]
plt.hist(delay_min,bins=25,log=True,color='b',alpha=0.3)    
plt.axvline(np.mean(delay_min), color='r', linestyle='dashed', linewidth=1)
plt.title('Histogram of Departure Delays(JB)')
plt.xlabel('Delay(min)')
plt.ylabel('No of flights')
plt.xticks(rotation=30)
plt.show()

The vertical line drawn using the axvline function indicates that the average departure delay time for JetBlue Airways flights flying out of JFK is 14 minutes. In fact, JetBlue Airways was named as the most-delayed airline at JFK airport.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: