# Statistics

### Table of Contents

#### Central Tendency

### Mean

One of the most basic properties of a data set is it’s mean. Mean is a measure of the ** Central Tendency** of data. In layman terms, if you were to pick one point in a dataset that is representative of the entire dataset, it would be the mean. Also called “average” in common vocabulary, calculating the mean is really simple. Just add up all the values in a particular variable of a dataset and divide it by the total number of values in it.

For example, if there are 5 Uber drivers with 5 different ratings, what is their mean ?

```
from statistics import mean
ratings = [4.5, 3.9, 4.6, 4.8, 3.9]
mean(ratings)
```

```
4.34
```

### Median

Mean does not always represent the *average* of the data. For normal distributions it typically does. However, for many other data distributions, mean does not represent the *average*. For example, take the US income distribution data. Since the number of housholds is huge, the data has been bucketed into income brackets for easy analysis.

```
income_bracket households
10-15k 5700000
15-20k 5620000
20-25k 5930000
25-30k 5500000
30-35k 5780000
35-40k 5340000
40-45k 5380000
45-50k 4730000
50-60k 9210000
60-75k 11900000
75-100k 14700000
100-125k 10300000
125-150k 6360000
150-200k 6920000
200k-plus 7600000
```

```
from statistics import mean, median
income = [5700000,5620000, 5930000, 5500000, 5780000, 5340000, 5380000,
4730000, 9210000, 11900000, 14700000, 10300000, 6360000, 6920000,7600000 ]
print ( "mean =", mean(income) )
print ( "median =", median(income) )
```

```
mean = 7398000
median = 5930000
```

As you can see, depending on the data distribution, mean and median could be totally different from each other.

### Mode

Mode is just the highest value in the dataset. If the dataset follows a gaussian distribution, it is the peak of the histogram. Think of **mode** of a dataset as the most commonly occuring number.

```
from statistics import mode
ages = [12,13,14,11,12,13,15,10,13 ]
mode(ages)
```

```
13
```

### Mean vs Median vs Mode

The reason why these terms exist is that, there is no one way to define the *Central Tendency* of a dataset. It depends on the nature of the distribution that the dataset conforms to. For example, if the dataset is a normal distribution, then most of the time the mean, median and mode are pretty close together. The relationship is beautifully visualized in this wikipedia diagram.

For a normal distribution, the 3 parameters ( mean, median, mode ) are pretty close together. For a skewed distribution, they are pretty staggered.

### Spread

### Standard Deviation

While mean represents the “Central” or “Average” value of a dataset, variance represents how spread out the data is. For example, look at the 3 *histograms* below. They represent histograms of GRE scores among 3 different groups.

All these 3 graphs represent 10K rows of GRE scores from 3 different groups. All 3 of them have the same mean – 300. However, they are different, right ? What I want you to focus on is the shape of the distribution, not the height. Specifically, look at the x-axis. In the first plot, the data is focussed, pretty much around the average(300) mark. In the second plot, it is a bit spread out ( hence the reduction in size ) and the third plot is pretty spread out.

The green line represents the mean and the red line represents the ** Standard Deviation** or σ ( represented by the Greek symbol sigma ). It is a measure of how spread out the distribution is. The more spread out the distribution is, the more flatter the bell curve is.

You can also think of *Standard Deviation* as a measure of uncertianity in Data Science. Imagine a drug trail working on reducing blood pressure. Say, on an average 3 different trails produce an average reduction of 30 points. However, trail 1 has very low standard deviation. Obviously, you would want to go with the first drug – because the third drug is more uncertain about the result.

```
from statistics import stdev
income = [5700000,5620000, 5930000, 5500000, 5780000, 5340000, 5380000,
4730000, 9210000, 11900000, 14700000, 10300000, 6360000, 6920000,7600000 ]
stdev(income)
```

```
2884498.470890811
```

### Variance

** Standard Deviation** is actually a derivative of variance. So, variance is calculated first. However, to illustrate that variance or standard deviation represents spread on the plot, we have learnt

*standard deviation*first. However, we have to understand how variance is calculated, because this is the basis for calculating

*standard deviation*.

Let’s calculate variance of a small dataset in excel to understand the process better.

variance_calculation

### What does Standard Deviation represent

Now that we know the basic definitions of ** Variance** and

**, let’s look at what it really represents.**

*Standard Deviation*### How far from the mean

Wikipedia has a nice picture to show this in the context of Gaussian distribution.

### How spread out

Another way to look at the *Standard Deviation* is , how spread out the distribution is. Once again, there is a nice representation of this in Wikipedia.

The more the *Standard Deviation* , the more the distribution is spread out.

### Why 2 different metrics to measure spread

Why are we using 2 different metircs (Variance & Standard Deviation) to measure the same parameter – spread ?

For one, you can only calculate *Standard Deviation* from _Mean. Also, ** Variance** is used to measure how far the data is spread out, while

**is used to measure how the data differs from its mean. However, since**

*Standard Deviation**Standard Deviation*is in the same metric as the data, it is used more often than Variance.