# How did you eliminate outliers in R

### What is an outlier ?

Outliers are data points that fall way out of the normal data range. For example, let’s take 3 sensors measuring temperature in Chicago on a winter day.

```
s_1 = round(rnorm(7,5,5),1)
s_2 = round(rnorm(7,5,5),1)
s_3 = round(rnorm(7,5,5),1)
```

Say one of the readings in one of the sensors malfunctioned.

```
# Inject a faulty recording
s_3[7] = -100
```

Let’s visualize the data

```
# Temperature in Chicago on a wintery day
s_data = c( s_1 , s_2, s_3 )
# Let's visualize
hist(s_data, breaks = 10)
```

As you can see, the distribution is highly skewed. The outlier is disturbing the nature of the distribution.

There are a couple of methods to remove outliers, and we are going to use the most intuitive and simplest method – Tukey’s method of removing anything > 1.5 IQR.

```
# Use Tukey's boxplot to get outlisers ( > 1.5 IQR )
> b_before = boxplot(s_data)
```

See, what is close to a normal distribution ( more apparent after the cleanse ) looks so skewed because of the outlier.

```
# Get the outlisers
> b_before$out
> b_before$out
[1] 13.1 -100.0
```

And the outliers are given in the **out** variable of the boxplot.

Let’s remove the outliers.

```
# Remove the outliers
> s_data_normalize = ifelse(s_data %in% b_before$out, NA, s_data)
```

Let’s do another boxplot now and see if things changed.

```
# box plot again - Looking much better.
b_after = boxplot(s_data_normalize)
```

You can see that the new histogram is much better as well.

```
> hist(s_data_normalize,breaks=20)
```