Probability & Statistics

for Data Science, Machine Learning, Deep Learning and AI

  Machine Learning in Python

Table of Contents

Intro

What is Statistics

Statistics is probably THE MOST IMPORTANT tool in the hands of Data Scientists and Machine Learning experts. In fact Machine Learning is referred to as Statistical Learning by statisticians. Statistics started almost 3 centuries ago as a formal science and is a pretty broad subject. We are only interested in learning specific aspects of it as they relate to data science and machine learning. However, we have to understand that Machine Learning is an evolutionary field – statistics happens to be just one of the tools. Some tend to say that ML is just glorified statistics , it is not true.

For example, if you look at some of the algorithms used in Machine Learning like K-Nearest Neighbour ( KNN ) the significance statistics ( atleast in it’s classic sense ) is very minimal.

Why learn Statistics in Machine Learning

Every phase of Machine Learning requires some level of statistics. It is beyond the scope at this point to discuss how specifically statistics can be used in ML. However, as we progress through the course, you will see that the underlying mathematics of most of ML is still Statistics.

Our Approach to learning Statistics

You don’t need a Ph.D in statistics to be a Data Scientist or ML expert. In fact, when it comes to deep learning, the significance of statistics drops even further. However, statistics still remains the most useful tool in most areas of ML. So, the way we learn statistics would be limited to just the fundamentals of statistical methods as they apply to Machine Learning. Also, we will learn the relevant statistical methods in an incremental fashion. And instead of learning the basic statistical methods as they are, we will keep our approach data oriented. This will provide the context under which we are learning specific techniques.

Broadly, there are two branches in statistics

  • Descriptive
  • Inferential

Before we understand what each of these mean, let’s understand a couple of terms.

Key Terms

Before we even get to the basics of statistics, we have to understand a couple of key terms.

  • Population
  • Sample
  • Variables
  • Distribution

Population vs Sample

A population represents the entire data set. For example, if you have the entire data of all the grades in a particular class – that dataset is called the population. The word population can be confusing. It is natural for non-statisticians to refer to population as people or living beings. However, We are not talking about the people of the class, but we are just talking about the data points only. You could be referring to a dataset of median house prices in a county or the age of stars in a particular galaxy. As long as you refer to the entire dataset, it is called as Population

If you take a small part out of it ( eithe randomly or sequentially or via other criteria ), it is called as a sample. For example, say Uber provides the rating data for all the drivers in New York City. Say there are 40,000 drivers. The entire dataset is referred to as the population. If you want to consider a particular part of the dataset , say a bunch of 10,000 drivers from it, it is called as a Sample.

In simple words

Population – The entire data set

Sample – A subset of the entire data set

Variables

In the Uber example above, there are 2 variables.

  • Driver
  • Rating

There are 2 types of variables in general.

Continous

Continous variables are typically numbers ( integers or floats ). Point being, they are on a continuum. For example, ratings are a contious variable. A rating of 4.5 is better than 3.5 – you can compare one rating with another numerically.

Categorical

On the other hand, if you had another variable say Gender . What kind of variable would that be ? It can only take 2 values – Male of Female. Since there are only a fixed set (categories ) of values it can take, these kind of variables are called categorical variables.

Distribution

There are many ways to define a distribution. In layman’s terms, a distribution is basically a data pattern. For example, think of a subway owner in downtown Chicago. Most of his customers are office goers. So, most of his business happens on weekdays. Mondays and Fridays are a bit less and business generally peaks on Wednesdays. Now, if you draw this pattern out with the actual number of customers (average) on each day of the week you would get a pattern like this.

What is the point in understanding this pattern ? A pattern is like a blue-print. For example,

  • if you understand the pattern of subway sales in downtown Chicago, you can very well understand the pattern in downtown New York, LA, SFO or any major city.
  • if a particular subway in downtown Boston does not fit this pattern, then there is some thing wrong. So, deviation from a pattern is used to spot anamolies.

Remember, this is just a pattern. The numbers need not match, but the pattern needs to. For example, look at the picture below – the average sales between New York and Chicago does not match, but the pattern does.

Visualize a Distribution

Now that we know what a distribution is, we have to define it’s characteristics. Let’s take the most well known distribution – Gaussian Distribution . Many datasets tend to follow a gaussian distribution. There are many names to it

  • Normal Distribution
  • Bell Curve

Say a group of 400 people are applying for admission to a college and their GRE scores are as follows.

%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np

# Read the scores file
scores = np.genfromtxt("../data/gre_scores.csv",delimiter=",")

# Plot the histogram
plt.hist(scores,bins=30)
plt.show()

Although this does not look very smooth, essentially the distribution is very close to Gaussian distribution. We have taken the help of a package called matplotlib ( which we will explore later ) to plot this distribution. To understand the way we draw a histogram, let’s look at how this plot was drawn.

plt.hist(scores,bins=30)
(array([ 2.,  1.,  2.,  9., 11.,  8., 17., 10.,  8., 14., 19.,  6., 20.,
        30., 14., 24., 21.,  9., 25., 24., 19., 24., 18.,  9., 14., 10.,
         7.,  9.,  5., 11.]),
 array([290.        , 291.66666667, 293.33333333, 295.        ,
        296.66666667, 298.33333333, 300.        , 301.66666667,
        303.33333333, 305.        , 306.66666667, 308.33333333,
        310.        , 311.66666667, 313.33333333, 315.        ,
        316.66666667, 318.33333333, 320.        , 321.66666667,
        323.33333333, 325.        , 326.66666667, 328.33333333,
        330.        , 331.66666667, 333.33333333, 335.        ,
        336.66666667, 338.33333333, 340.        ]),

Ths first array represents the y-axis

array([ 2.,  1.,  2.,  9., 11.,  8., 17., 10.,  8., 14., 19.,  6., 20.,
        30., 14., 24., 21.,  9., 25., 24., 19., 24., 18.,  9., 14., 10.,
         7.,  9.,  5., 11.]),

and the second array represents the x-axis

array([290.        , 291.66666667, 293.33333333, 295.        ,
        296.66666667, 298.33333333, 300.        , 301.66666667,
        303.33333333, 305.        , 306.66666667, 308.33333333,
        310.        , 311.66666667, 313.33333333, 315.        ,
        316.66666667, 318.33333333, 320.        , 321.66666667,
        323.33333333, 325.        , 326.66666667, 328.33333333,
        330.        , 331.66666667, 333.33333333, 335.        ,
        336.66666667, 338.33333333, 340.        ]),

What has been done here is to essentially divide the 400 scores into 30 bins (or buckets). The average value of each bin is what is shown on the x-axis. The frequency of occurance of the scores within the bin is what the y-axis represents.

For example,

  • how many scores are there between 290 and 291.66 ? – 2
  • how many between 291.66 and 293.33 ? – 1
  • and so on.

Once you plot all of them, you get a plot like shown above – It is also called a histogram . We will know more about histograms later.

Some kinds of data is not Gaussian. A classic example is the US income. Just to make things simple, here is a sample picture of US income depending on education level.

Image taken from Wikipedia

As you can see clearly, this data is not gaussian by any measures. We will see more examples of different types of distribution later.

Define a Distribution

Central Tendency

Mean

One of the most basic properties of a data set is it’s mean. Mean is a measure of the Central Tendency of data. In layman terms, if you were to pick one point in a dataset that is representative of the entire dataset, it would be the mean. Also called “average” in common vocabulary, calculating the mean is really simple. Just add up all the values in a particular variable of a dataset and divide it by the total number of values in it.

For example, if there are 5 Uber drivers with 5 different ratings, what is their mean ?

Similarly, you can calculate the mean of all the 400 GRE scores we have loaded above using numpy’s handy mean ( ) function

mean = np.mean(scores)
mean

316.8075

Median

Mean does not always represent the average of the data. For normal distributions it typically does. However, for many other data distributions, mean does not represent the average. For example, take the US income distribution data. As you can see the data is heavily skewed towards the end ( from the income vs education plot above ). Here is another dataset of income vs the number of households in the US. Since the number of housholds is huge, the data has been bucketed into income brackets for easy analysis.

income_bracket    households
10-15k            5700000
15-20k            5620000
20-25k            5930000
25-30k            5500000
30-35k            5780000
35-40k            5340000
40-45k            5380000
45-50k            4730000
50-60k            9210000
60-75k           11900000
75-100k          14700000
100-125k         10300000
125-150k          6360000
150-200k          6920000
200k-plus         7600000
import matplotlib.pyplot as plt
import pandas as pd

income_data = pd.read_csv("../data/us_income_brackets.csv")
plt.bar(income_data["income_bracket"], income_data["households"])
plt.show()

As you can see from the picture above, the data is skewed. Now, let’s calculate the mean and median and you will know the difference.

income_data["households"] = pd.to_numeric(income_data["households"])
income_data.mean()
income_bracket          0.0
households        7398000.0
dtype: float64
income_data.median()

households    5930000.0
dtype: float64

As you can see, there is a difference between the mean of the households ( 7.3 M ) vs the median of the households ( 5.9 M). In cases of distribution like this, a median is more representative than the median.

The way mode is calculated is also simple. Just order the values in increasing order and pick the middle element.

For example, in our case, take the households and arrange them in order.

households = income_data["households"]
households

0      5700000
1      5620000
2      5930000
3      5500000
4      5780000
5      5340000
6      5380000
7      4730000
8      9210000
9     11900000
10    14700000
11    10300000
12     6360000
13     6920000
14     7600000
Name: households, dtype: int64
households.sort_values()
7      4730000
5      5340000
6      5380000
3      5500000
1      5620000
0      5700000
4      5780000
2      5930000
12     6360000
13     6920000
14     7600000
8      9210000
11    10300000
9     11900000
10    14700000
Name: households, dtype: int64

There are 15 values and the middle value is the 8th object. So the median of the households is 5.93 M

Mode

Mode is just the highest value in the dataset. If the dataset follows a gaussian distribution, it is the peak of the histogram.

Mean vs Median vs Mode

The reason why these terms exist is that, there is no one way to define the Central Tendency of a dataset. It depends on the nature of the distribution that the dataset conforms to. For example, if the dataset is a normal distribution, then most of the time the mean, median and mode are pretty close together. The relationship is beautifully visualized in this wikipedia diagram.

income_data.mode()
      income_bracket	households
0	10-15k	        4730000
1	100-125k	5340000
2	125-150k	5380000
3	15-20k	        5500000
4	150-200k	5620000
5	20-25k	        5700000
6	200k-plus	5780000
7	25-30k	        5930000
8	30-35k	        6360000
9	35-40k          6920000
10	40-45k	        7600000
11	45-50k	        9210000
12	50-60k	        10300000
13	60-75k	        11900000
14	75-100k	        14700000

For a normal distribution, the 3 parameters ( mean, median, mode ) are pretty close together. For a skewed distribution, they are pretty staggered.

Spread

Standard Deviation

While mean represents the “Central” or “Average” value of a dataset, variance represents how spread out the data is. For example, look at the 3 histograms below. They represent histograms of GRE scores among 3 different groups.

import matplotlib.pyplot as plt
import numpy as np

# Read the scores file
scores = np.genfromtxt("../data/gre_scores_3_sets.csv",delimiter=",")

f, (ax1, ax2,ax3) = plt.subplots(1, 3, sharey=True)
ax1.set_xlim(200,400)
ax1.hist(scores[:,0],bins=300)
ax2.set_xlim(200,400)
ax2.hist(scores[:,1],bins=300)
ax3.set_xlim(200,400)
ax3.hist(scores[:,2],bins=300)
plt.show()

All these 3 graphs represent 10K rows of GRE scores from 3 different groups. All 3 of them have the same mean – 300. However, they are different, right ? What I want you to focus on is the shape of the distribution, not the height. Specifically, look at the x-axis. In the first plot, the data is focussed, pretty much around the average(300) mark. In the second plot, it is a bit spread out ( hence the reduction in size ) and the third plot is pretty spread out.

The green line represents the mean and the red line represents the Standard Deviation or σ ( represented by the Greek symbol sigma ). It is a measure of how spread out the distribution is. The more spread out the distribution is, the more flatter the bell curve is.

You can also think of Standard Deviation as a measure of uncertianity in Data Science. Imagine a drug trail working on reducing blood pressure. Say, on an average 3 different trails produce an average reduction of 30 points. However, trail 1 has very low standard deviation. Obviously, you would want to go with the first drug – because the third drug is more uncertain about the result.

import matplotlib.pyplot as plt
import numpy as np

# Read the scores file
scores = np.genfromtxt("../data/gre_scores_3_sets.csv",delimiter=",")

f, (ax1, ax2,ax3) = plt.subplots(1, 3, sharey=True)
ax1.set_xlim(200,400)
ax1.axvline(x=305, ymin=0, ymax = 800, linewidth=2, color='r')
ax1.axvline(x=300, ymin=0, ymax = 800, linewidth=2, color='g')
ax1.hist(scores[:,0],bins=300)
ax2.set_xlim(200,400)
ax2.axvline(x=315, ymin=0, ymax = 800, linewidth=2, color='r')
ax2.axvline(x=300, ymin=0, ymax = 800, linewidth=2, color='g')
ax2.hist(scores[:,1],bins=300)
ax3.set_xlim(200,400)
ax3.axvline(x=330, ymin=0, ymax = 800, linewidth=2, color='r')
ax3.axvline(x=300, ymin=0, ymax = 800, linewidth=2, color='g')
ax3.hist(scores[:,2],bins=300)
plt.show()

Now that we have seen what standard deviation is visually, calculating it is just as easy in numpy.

print ( scores )
np.std(scores,axis=0) # axis=0 to specify that we want standard deviation along columns.

[[296. 295. 293.]
 [301. 277. 324.]
 [299. 316. 311.]
 ...
 [299. 283. 286.]
 [301. 299. 272.]
 [298. 301. 285.]]

array([ 4.99548675, 15.01404265, 30.08496664])

Variance

Standard Deviation is actually a derivative of variance. So, variance is calculated first. However, to illustrate that variance or standard deviation represents spread on the plot, we have learnt standard deviation first. However, we have to understand how variance is calculated, because this is the basis for calculating standard deviation .

Let’s calculate variance of a small dataset in excel to understand the process better.

variance_calculation

You can use numpy or pandas to calculate variance. Continuing wiht our scores data above,

print(scores)
np.var(scores,axis=0) ## axis=0 to specify that we want variance across the columns

[[296. 295. 293.]
 [301. 277. 324.]
 [299. 316. 311.]
 ...
 [299. 283. 286.]
 [301. 299. 272.]
 [298. 301. 285.]]

array([ 24.95488791, 225.42147676, 905.10521756])

What does Standard Deviation represent

Now that we know the basic definitions of Variance and Standard Deviation, let’s look at what it really represents.

How far from the mean

Wikipedia has a nice picture to show this in the context of Gaussian distribution.

In a normal distribution, a single standard deviation represents roughly 34% of the values in the entire distribution. If you look at both the positive and negative deviation it is roughly 68% of the total values in the distribution. In the context of GRE scores, if your GRE score is 1 standard deviation away from the mean, you are essentially almost 68% off of the average score. That is not a bad thing for GRE score because you would not want an average score right ?

How spread out

Another way to look at the Standard Deviation is , how spread out the distribution is. Once again, there is a nice representation of this in Wikipedia.

The more the Standard Deviation , the more the distribution is spread out.

Why 2 different metrics to measure spread

Why are we using 2 different metircs (Variance & Standard Deviation) to measure the same parameter – spread ?

For one, you can only calculate Standard Deviation from _Mean. Also, Variance is used to measure how far the data is spread out, while Standard Deviation is used to measure how the data differs from its mean. However, since Standard Deviation is in the same metric as the data, it is used more often than Variance.

Skewness

Look at the picture below ( courtesy of Wikipedia )

These are distributions that are skewed ( either to the right or to the left ) – meaning they are asymmetric as compared to a standard distribution.

Skewness is a measure of that asymmetry . A very popular example of this data is the income data. The following picture shows us the income distribution of the US, but almost all countries follow a similar pattern.

This is an example of a right skewed distribution ( where the tail is more pronounced to the right ). Similarly, a left skewed distribution would be an opposite of this ( long tail to the left ). To measure the skewness of a distribution, pandas has a funcion called skew ( ).

import pandas as pd

scores = pd.read_csv("../data/gre_scores_3_sets.csv")
scores.head()

        296	295	293
0	301	277	324
1	299	316	311
2	296	275	353
3	299	309	281
4	302	285	298
scores.skew()
296   -0.015089
295   -0.042408
293    0.037187
dtype: float64

skew ( ) calculates the skewness of all the numeric parameters in the dataframe. You could very well look at a specific column by specifying it like so.

scores.iloc[0,:].skew()

-0.06381213242716723

Kurtosis

Quite simply, kurtosis is the sharpness of the peak in a probablity density function. For example, look at the following picture from wikipedia.

More the kurtosis, sharper the peak. Once again, pandas has a function called kurtosis ( ) that measures the kurtosis of a dataset.

scores.kurtosis()
296   -0.011459
295    0.002555
293   -0.009964
dtype: float64

Branches of Statistics

Descriptive Statistics

When you are trying to describe or summarize data, the stats that you produce are typically called descriptive statistics. We have seen some examples of these

  • mean
  • standard deviation
  • skewness
  • kurtosis
  • defining the distribution ( for example as normal or not ) etc

These are all examples of descriptive statistics. Why use descriptive statistics ? Because raw number don’t mean much. For example, when we looked at the GRE scores dataset above, human brain can’t infer much about raw numbers as-is. However, summary stats and pictures mean much more as we have seen above. Many means could be used for describing data like pictures, summary statistics, tables etc. As long as your intent is to describe the data in a more meaningful way, it is called descriptive statistics.

Inferential Statistics

If you wanted to describe the data of all the GRE scores in say India, is it possible ? Well, it might be possible if you were the authority that administers GRE score. However, if you didn’t have access to the entire dataset, how can you make any kind of inferences about the entire set of GRE scores in India ?

You start with a Sample – Say, you pull out samples from across some of the students who have taken the exam and you try to extrapolate the results across the entire scores dataset.

When you take a small sample dataset (from a population) and try to infer the characteristics of the larger dataset (the entire population), it is called Inferential Statistics

Let’s talk about some more examples to understand this.

Example 1 Say you are a child psychologist trying to understand the behaviour of a particular diet on kids with a certain disorder. Obviously, you cannot test every kid in the country. So, you pick a sample – say 10 or 20 kids. To make the results more generic, you have to ensure that the kids are well representative of the entire population of kids that have that particular disorder. So, you would ensure that the kids are

  • from geographically diverse areas of the country
  • from all different ethnical or racial backgrounds
  • from all different age groups etc

The key however is that you don’t work with the entire population. You just work with a sample, and assume that this works with the rest of the population as well.

Example 2 Say you are a scientist at a drug company, trying to develop a drug for a certain type of disease. Once again, you cannot test every patient in the country. So, you enlist some patients and start trials on them. Some of them might have an effect and some might not. You cannot draw a solid conclusion from this. You switch the drug and start your trials again. This time let’s say all of them showed postive results.

Now, that does not necessarily mean that the drug will work on the generic population. You would have to expand your trial – say

  • start multiple trials with different groups of patients – to ensure the drug works on the generic population also.
  • start trials on placebo groups

the point being, you have to make the sample selection as exhaustive as possible.

Central Limit Theorem

That brings us to one of the most important theorem in statistics – Central Limit Theorem . But before we explore this, it is better to understand another important theorem called the Law of Large Numbers .

Law of Large Numbers

It is a pretty simple law and pretty intuitive as well. In simple words, larger the sample size, more accurate the result would be .

Sounds intuitive enough, isn’t it ? the more observations you take, the more closer your results would be to the ideal result. For example, we know that a fair coin has an equal change of a head and a tail – that is the ideal result.

Or take this example – A bunch of circles and squares.

The more the sample size, the more is it a better representative of the population itself.

import numpy as np

# population size
p_size = 1000

# sample size
s_size = 10

# Generate a random population
population = np.random.randn(p_size)

np.mean(population)
0.0375967463932297
import matplotlib.pyplot as plt
%matplotlib inline

plt.hist(population,bins=50)

(array([ 1.,  0.,  1.,  3.,  0.,  1.,  2., 10.,  7., 15., 12., 16., 17.,
        17., 35., 31., 36., 27., 31., 45., 46., 50., 48., 58., 56., 49.,
        43., 55., 35., 42., 30., 36., 26., 24., 10., 14., 15., 12., 12.,
         5.,  8.,  7.,  2.,  4.,  2.,  0.,  1.,  0.,  1.,  2.]),
 array([-2.95130726, -2.82617151, -2.70103575, -2.57589999, -2.45076424,
        -2.32562848, -2.20049273, -2.07535697, -1.95022122, -1.82508546,
        -1.69994971, -1.57481395, -1.44967819, -1.32454244, -1.19940668,
        -1.07427093, -0.94913517, -0.82399942, -0.69886366, -0.5737279 ,
        -0.44859215, -0.32345639, -0.19832064, -0.07318488,  0.05195087,
         0.17708663,  0.30222239,  0.42735814,  0.5524939 ,  0.67762965,
         0.80276541,  0.92790116,  1.05303692,  1.17817268,  1.30330843,
         1.42844419,  1.55357994,  1.6787157 ,  1.80385145,  1.92898721,
         2.05412296,  2.17925872,  2.30439448,  2.42953023,  2.55466599,
         2.67980174,  2.8049375 ,  2.93007325,  3.05520901,  3.18034477,
         3.30548052]),

This more or less represents a decent gaussian distribution with mean of 0 (0.03759). Now, let’s draw samples from this population.

sample = np.random.choice(population,s_size,replace=False)
np.mean(sample)

0.2746352091713792
s_size = 25
sample = np.random.choice(population,s_size,replace=False)
np.mean(sample)

-0.15866852660180053
s_size = 50
sample = np.random.choice(population,s_size,replace=False)
np.mean(sample)

0.10483477943669493

See, as the sample size increases, the properties of the sample get closer and closer to the original population. This is the law of large numbers. We can infact plot the accuracy of mean as the sample size increases.

# Let's define an array to hold the different sample mean's for different sizes.
means = []

for size in range(p_size) :
    sample = np.random.choice(population,size,replace=False)
    mean   = np.mean(sample)
    means.append(mean)

plt.plot(means)

See, as the sample size increases ( towards the right ), the mean of the sample gets more and more accurate.

Significance of Law of Large Numbers

The key take away for us is that when you sample a population, go for the biggest sample size possible. Of course there are other parameters to be considered when taking a sample like it being

  • representative of the entire population
  • diverse
  • consistent etc

Later on, when we start preparing training and test datasets, we will understand how important random numbers are in picking up samples from a population.

Central Limit Theorem

Sometimes people get confused between the Law of Large Numbers and Central Limit Theorem . There is very distinct difference between them. Let’s explore the Central Limit Theorem now.

We have seen some normal distributions so far ( also called Gaussian distribution ). However, not all distributions are Gaussian (just remember the bell curve). So, how can we apply the principles of statistical methods (that show good results with normal distributions) to these other distributions ?

That’s precisely where the Central Limit Theorem comes in. Say for example, we have a large distribution, and we draw a bunch of samples from them. We know from the Law of Large Numbers , that as the sample size increases, it resembles the properties of the population much more closely.

What if, we take a whole bunch of random samples from the population, say of the same size. How do they fare ?

For example, say, there is a large sample of coin tosses ( with heads or Tails as outcomes ), if we take a bunch of samples ( say of size 10 ) repeatedly, do they form a pattern ? The Central Limit Theorem says it does – it forms a normal distribution. That is actually surprising but true.

Let’s see it in action.

import numpy as np

# population size
p_size = 1000

# sample size
s_size = 10

# Generate a random population
population = np.random.randn(p_size)
np.mean(population)

-0.025421597275081576

Now that we know the mean, let’s start pulling a bunch of random samples, and see how their mean’s fare.

# Let's define an array to hold the different sample mean's for different samples.
means = []

# Say the sample size is always a constant of 10
s_size3 = 10

# This time, we are not changing the sample size, but we just take a whole bunch of samples

sample_count = 1000

for count in range(sample_count) :
    sample = np.random.choice(population,s_size,replace=False)
    mean   = np.mean(sample)
    means.append(mean)

# Now, plot a histogram
plt.hist(means,bins=50)
(array([ 1.,  0.,  0.,  1.,  2.,  0.,  1.,  2.,  5.,  7.,  4., 11., 12.,
        16., 16., 27., 26., 40., 40., 43., 35., 46., 54., 68., 44., 64.,
        45., 53., 47., 55., 37., 23., 28., 30., 24., 15., 15., 10., 19.,
         9.,  6.,  3.,  6.,  6.,  1.,  0.,  1.,  1.,  0.,  1.]),
 array([-1.14203098, -1.09789265, -1.05375433, -1.009616  , -0.96547767,
        -0.92133934, -0.87720101, -0.83306268, -0.78892435, -0.74478602,
        -0.7006477 , -0.65650937, -0.61237104, -0.56823271, -0.52409438,
        -0.47995605, -0.43581772, -0.39167939, -0.34754107, -0.30340274,
        -0.25926441, -0.21512608, -0.17098775, -0.12684942, -0.08271109,
        -0.03857277,  0.00556556,  0.04970389,  0.09384222,  0.13798055,
         0.18211888,  0.22625721,  0.27039554,  0.31453386,  0.35867219,
         0.40281052,  0.44694885,  0.49108718,  0.53522551,  0.57936384,
         0.62350217,  0.66764049,  0.71177882,  0.75591715,  0.80005548,
         0.84419381,  0.88833214,  0.93247047,  0.97660879,  1.02074712,
         1.06488545]),

See, thats a histogram that closely resembles a gaussian curve. That is what the Central Limit Theorem says. Well of course, there are some assumptions as usual like

  • All the samples should be equally random
  • None of the subsequent samples should be dependent on the previous samples

Significance of Central Limit Theorem

Not all distributions are normal. We would have to work with varied distributions in real life. However, thanks to Central Limit Theorem,we know that the distribution of the samples taken is always normal.

Correlation

Correlation is a measure of the relationship between 2 variables. For example, is there a correlation between the height and weight of an individual ? Actually, there is a good correlation. Of course there are outliers and exceptions. Of course, statistics is all about data and numbers. So, when we say there is a correlation, we have to quantity it. There are multiple methods to quantify a correlation.

Here is a sample 500 person dataset that we have downloaded from Kaggle (https://www.kaggle.com/mustafaali96/weight-height)

import pandas as pd

height_weight = pd.read_csv("../data/weight-height.csv")
height_weight.head()

  Gender	Height	  Weight
0	Male	73.847017	  241.893563
1	Male	68.78190 	  162.310473
2	Male	74.110105	  212.740856
3	Male	71.730978	  220.042470
4	Male	69.881796	 206.349801

Let’s plot the height vs weight to see if there is any correlation at all. The simplest plot to do this is a Scatter Plot

import matplotlib.pyplot as plt

plt.scatter(height_weight["Height"],height_weight["Weight"])

As you can see, there is a definite correlation between height and weight. As height increases, weight increases as well.

height_weight.iloc[:,1:3].corr()

        Height	        Weight
Height	1.000000	0.924756
Weight	0.924756	1.000000

Quantitatively, Correlation is defined as a value between -1 and 1.

  • Negative values signify a negative correlation ( one variable increases and the other decreases)
  • Positive values signifiy a positive correlation ( one variables increases and the other increases as well)
  • zero means that there is no correlation at all.

So, the correlation between Height and Weight in this case is 0.92 – meaning they are very positively correlated. Which is to be expected, right ? We know that as the height increases, the weight also increases.

There are multiple methods used to calculate the correlation co-efficient. By default, pandas uses Pearson method. Of course, there are other methods available.

Interquartile Range or IQR

Interquartile range or IQR is an good metric to detect the breadth of data. It is useful is understanding

  • Where does the bulk of data lie ?
  • Are there any outliers ?

Let’s take a simple dataset to understand this. Say a random set of 10 numbers between 1 and 100,

import numpy as np

data = np.random.randint(1,100,10)
data
array([63, 90, 54, 58, 53, 98,  9, 80, 80, 32])

Let’s explore how this data is distributed. Here are the steps we will be doing

  • Calculate median
  • Calculate Quartiles
  • Calculate IQR or Interquartile Range

Calculate Median

Sort the numbers and pick the middle number. If there are two numbers, calculate the average of them.

Use numpy’s median function to calculate the same programmatically.

data = np.array([79,24,87,77,37,69,58,17,65,17])
np.median(data)

61.5

Calculate Quartiles

quantile_1 = np.quantile(data,0.25,interpolation ="midpoint")
quantile_2 = np.quantile(data,0.5,interpolation ="midpoint")
quantile_3 = np.quantile(data,0.75,interpolation ="midpoint")

print ( quantile_1, quantile_2, quantile_3)
30.5 61.5 73.0

Boxplot

A box plot ( also called box and whisker plot ) does exactly that. It shows the quantile ranges of the data.

import matplotlib.pyplot as plt
%matplotlib inline

plt.boxplot(data)
plt.savefig("boxplot.png")

What does IQR represent ?

IQR represents 50% of the data. If you flip it sideways and compare it to a normal distribution, you will understand this better.

Outliers

Outliers are typically data points that are somehow incorrect. Most of the time it is a data collection error. Outliers affect our interpretation of data and hence it is essential that they be removed to get a real sense of the data.

Let’s insert a couple of large numbers.

data_new = np.append(data,[150,180])

import matplotlib.pyplot as plt
%matplotlib inline

plt.boxplot(data_new)

Why is that you only see one outlier ( 180 ) and not two (190 and 150) ?

quantile_1 = np.quantile(data_new,0.25,interpolation ="midpoint")
quantile_2 = np.quantile(data_new,0.5,interpolation ="midpoint")
quantile_3 = np.quantile(data_new,0.75,interpolation ="midpoint")
quantile_4 = np.quantile(data_new,1,interpolation ="midpoint")

print ( quantile_1, quantile_2, quantile_3,quantile_4)

30.5 67.0 83.0 180.0
outlier_upper = quantile_3 + 1.5 * ( quantile_3 - quantile_1)

outlier_upper

161.75

Which is greater than 150 but less than 180. Hence 180 has been classified as an outlier. So, anything beyond 2.7 σ ( or standard deviation ) is essentially an outlier according to normally distributed data.

anything beyond 2.7 σ ( or standard deviation ) is essentially an outlier according to normally distributed data

Boxplot is essentially a good visual way to

  • identify outliers
  • see where the bulk (IQR) of the data is