 # What is confusion matrix

Confusion matrix is typically used to compare the performance of a model. For example, when you do KNN on the iris data set, you can use the confusion matrix to compare the real values vs the predicted values.

```#  Split the data into train and test set.
require(caret)
set.seed(100)

# Get the split indices
trainIndex <- createDataPartition(iris\$Species, p = .8,
list = FALSE,
times = 1)
# Generate the training and test data
train = iris[trainIndex,]
test = iris[-trainIndex,]

# The "class" package
require(class)
predict = knn(train = train[,1:4],
test = test[,1:4],
cl = train[,5],
k = 5)

table(predict,test[,5])

predict      setosa versicolor virginica
setosa         10          0         0
versicolor      0          8         1
virginica       0          2         9
```

What do you see in the confusion matrix ?

10 out of 10 times, “setosa” species was correctly predicted. 8 out of 9 times “versicolor” was correctly predicted. 9 out of 11 times, “virginica” was correctly predicted.

# What plots in R did you produce to communicate your linear regression test results

### One Predictor and one response variable

If you just have one predictor variable and one response variable, the visualization for the same is pretty easy. For example, take the built in cars data which gives breaking distance to the car’s speed. We are predicting the breaking distance based on the car’s speed.

```> model = lm ( dist ~ speed , data = cars)
> model\$coefficients
(Intercept)       speed
-17.579095    3.932409
```

Now, let’s plot the data vs the predicted line.

```ggplot ( ) +
geom_point ( data = cars,  aes(x = speed, y = dist )) +
geom_abline( intercept = model\$coefficients, slope = model\$coefficients)
```

This gives a good visual on how the model fits the data.

### Multiple Predictors

What about regression with multiple predictors ? Well, the straight answer is that there is no good way to visualize linear regression model for more than 1 predictor. You can go 3-D, but it is not easy to interpret.

Since visualizing more than 2 predictors is a challenge, we tend to focus more on validation. For example, the plot ( model ) function gives the following validation plots.

```model = lm ( mpg ~ . , data = mtcars)
```
```> summary(model)
Call:
lm(formula = mpg ~ ., data = mtcars)
Residuals:
Min      1Q  Median      3Q     Max
-3.4506 -1.6044 -0.1196  1.2193  4.6271
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 12.30337   18.71788   0.657   0.5181
cyl         -0.11144    1.04502  -0.107   0.9161
disp         0.01334    0.01786   0.747   0.4635
hp          -0.02148    0.02177  -0.987   0.3350
drat         0.78711    1.63537   0.481   0.6353
wt          -3.71530    1.89441  -1.961   0.0633 .
qsec         0.82104    0.73084   1.123   0.2739
vs           0.31776    2.10451   0.151   0.8814
am           2.52023    2.05665   1.225   0.2340
gear         0.65541    1.49326   0.439   0.6652
carb        -0.19942    0.82875  -0.241   0.8122
```
```> plot(model)
```

# How did you split the data into training and test data

Say you have a dataset – iris. To test your ML models, you typically need to divide your data into training and test data sets. Here are some methods to do it.

### From Scratch

You can do this from scratch without using libraries.

```# A simple function to take in data and split by percentage
train_test_split = function (data, perc_split) {
# Get the size of the dataset
data_size = nrow(data)
# Calculate the sample size required
sample_size = round( data_size * perc_split)
# Get the sample indices
sample = sample(1:data_size,
size =sample_size,
replace = FALSE)
# Sort the sample vector before sending
return (sort(sample))
}
```

Now, you can use this function to split the data into training and test data sets

### Training dataset

```> train = iris[train_test_split(iris,0.75),]
```

### Test dataset

```> test = iris[-train_test_split(iris,0.75),]
```

### Using libraries

There are many libraries to do this split for us.

### caTools package

```require(caTools)
set.seed(100)
# returns
sample = sample.split(iris\$Sepal.Length,SplitRatio = 3/4)
# Training data set
train = iris[sample==TRUE,]
test = iris[sample==FALSE,]
```

### CARET package

```library(caret)
set.seed(100)
# Get the split indices
trainIndex <- createDataPartition(iris\$Species, p = .8,
list = FALSE,
times = 1)
# Generate the training and test data
train = iris[trainIndex,]
test = iris[-trainIndex,]
```

# How to do Linear Regression in R

R has built in linear regression functions. Most of the time, lm() function should be good enough.

### Simple Linear Regression

For example, if you wanted to find out the relationship between the waiting time and eruption time of the old faithful data, you can do it like this.

```model = lm(waiting ~ eruptions , data = faithful)
```
```> coef = model\$coefficients
> coef
(Intercept)   eruptions
33.47440    10.72964
```
```require(ggplot2)
g = ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
geom_point  ( ) +
geom_abline ( aes(intercept = coef,
slope     = coef))
```

### Multi linear Regression

Multi linear regression is where more than one variable is used as a predictor. For example, if you take the Boston Housing data in mlbench package,

```> library(mlbench)
crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7
```

medv (Median Value in 1000s. For example, 24.0 is 24,000 \$) is the response variable and the rest are all predictors. As usual, not all of the predictors have the same “predictive” power. There are many methods used to identify the best predictors. Let’s just use all the predictors and see how the linear model works.

```> model = lm ( medv ~ . , data = BostonHousing)
> summary(model)
Call:
lm(formula = medv ~ ., data = BostonHousing)
Residuals:
Min      1Q  Median      3Q     Max
-15.595  -2.730  -0.518   1.777  26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 **
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288
chas1        2.687e+00  8.616e-01   3.118 0.001925 **
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 **
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
b            9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16
```

Or you can use a specific subset of the predictors based on step wise regression in R. Say you have determined that the best predictors are

• crim ( Crime rate )
• chas ( Dummy variable )
• nox ( nitrous oxide )
• rm ( Number of rooms )

The final model now is

```> model = lm ( medv ~ crim + chas + nox + rm, data = BostonHousing)
> summary(model)
Call:
lm(formula = medv ~ crim + chas + nox + rm, data = BostonHousing)
Residuals:
Min      1Q  Median      3Q     Max
-21.797  -3.194  -0.803   2.461  39.301
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.25962    3.21548  -5.368 1.22e-07 ***
crim         -0.18461    0.03449  -5.353 1.32e-07 ***
chas1         4.67381    1.06618   4.384 1.42e-05 ***
nox         -14.96036    2.63920  -5.669 2.43e-08 ***
rm            7.70684    0.40273  19.137  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.987 on 501 degrees of freedom
Multiple R-squared:  0.5796,    Adjusted R-squared:  0.5763
F-statistic: 172.7 on 4 and 501 DF,  p-value: < 2.2e-16
```

### Formula to specify multi linear regression

```model = lm ( medv ~ crim + chas + nox + rm, data = BostonHousing)
```

The formula to specify multiple variables in a liner regression is easy. Just separate all the variables with a + .

# What kind of version management did you use in your previous projects

All of your R course can be version controlled using any of the following source control software. However, please keep in mind that these kinds of version control software is not good for large volumes of data ( especially non-text media ). You would have to rely on file storage with manual versioning for the same.

### git

git is probably the most popular version control in use today. All the code can be fully version managed using git.

### Subversion

Subversion is another popular version control software.

R studio supports integration of both git and subversion source control software. Go to Tools -> Global Options

and go to the Git/SVN tab.

Your git or svn executable should automatically be picked up. If not, specify the executable.

As soon as you start creating the project and write the code, the git tab will automatically start tracking the changes in the underlying file system.

You can perform commits, diff’s between files and the most basic source control operations right from here.

# How to split your program across multiple files in R

Although R programs are typically not as large as python, real programs in production can be pretty big depending on the nature of the problem. Naturally, you would want to split your source files into multiple files depending on the complexity of the program.

Including source code from other files is really easy in R. use the function source () to specify the file to be included.

```# add_numbers.r
add = function ( i , j ) {
return ( i + j )
}
```

Now, to include the file add_numbers.r , all you have to is include it as follows.

```# main.r
```

# What strategies did you use to eliminate NA

### Substituting with feature average

Step 1 – Get the row mean

```> s_data
s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98  NA 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97
```
```> mean(s_data\$s_2,na.rm = TRUE)
 100
```

Step 2 – Replace the NA with row mean

```> s_data[5,"s_2"] = mean(s_data\$s_2,na.rm = TRUE)
```
```> s_data
s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98 100 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97
```

If you wanted to do this pro grammatically ( for any number of rows,columns ), do this

```replaceNA = function(data) {
# Loop through each column
for ( var in 1:ncol(data)){
# Get the mean odf the column.
mean = mean ( data[,var], na.rm = TRUE)
# Replace the NA with the mean
# is.na ( data[,var]) - Gets the T/F vector where value is NA
data[ is.na ( data[,var]),var ] = mean
}
# return the dataset
data
}
```
```> replaceNA(s_data)
s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98 100 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97
```

### Eliminate the entire observation

If the number of observations with NAs is quite small, the easy way is to get rid of the entire observation.

For example, getting rid of the 6th row above with NA is very easy. Just use the function omit.na()

```> s_data
s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98  NA 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97
```
```> na.omit(s_data)
s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97
```

As you can see, the 5th row is gone.

# How to convert a factor variable to a numeric variable

Converting factors to numeric or character values and vice-versa is used in a variety of cases.

### Factor to String

```> str(iris)
'data.frame':    150 obs. of  5 variables:
\$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
\$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
\$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
\$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
\$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
```

Let’s convert the Species column from factor to string

```> iris_1 = iris
> iris_1\$Species = as.character(iris\$Species)
> str(iris_1)
'data.frame':    150 obs. of  5 variables:
\$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
\$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
\$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
\$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
\$ Species     : chr  "setosa" "setosa" "setosa" "setosa" ...
```

### String to Factor

Let’s revert to the original factor based iris dataset.

```> iris_2 = iris_1
> iris_2\$Species = as.factor(iris\$Species)
> str(iris_2)
'data.frame':    150 obs. of  5 variables:
\$ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
\$ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
\$ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
\$ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
\$ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
```

### Factor to Numeric

```> iris_3 = iris
> iris_3\$Species = as.numeric(iris\$Species)
> iris_3
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1            5.1         3.5          1.4         0.2       1
2            4.9         3.0          1.4         0.2       1
3            4.7         3.2          1.3         0.2       1
4            4.6         3.1          1.5         0.2       1
```

# Why is which () function used

Normally, logical operations in a dataframe or a vector return an array of the same size. For example, if you take the iris dataset,

```> iris\$Species == "setosa"
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
 FALSE FALSE FALSE FALSE FALSE FALSE FALSE
```

The resulting vector is as big as the main dataframe itself – 150 rows. However, if only a fraction of the values are TRUE (bulk of them are FALSE), then it is better to use which function – which will only return the indices of the TRUE values. For example,

```> which(iris\$Species == "setosa")
  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27
 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50
```

# How did you eliminate outliers in R

### What is an outlier ?

Outliers are data points that fall way out of the normal data range. For example, let’s take 3 sensors measuring temperature in Chicago on a winter day.

```s_1 = round(rnorm(7,5,5),1)
s_2 = round(rnorm(7,5,5),1)
s_3 = round(rnorm(7,5,5),1)
```

Say one of the readings in one of the sensors malfunctioned.

```# Inject a faulty recording
s_3 = -100
```

Let’s visualize the data

```# Temperature in Chicago on a wintery day
s_data = c( s_1 , s_2, s_3 )
# Let's visualize
hist(s_data, breaks = 10)
```

As you can see, the distribution is highly skewed. The outlier is disturbing the nature of the distribution.

There are a couple of methods to remove outliers, and we are going to use the most intuitive and simplest method – Tukey’s method of removing anything > 1.5 IQR.

```# Use Tukey's boxplot to get outlisers ( > 1.5 IQR )
> b_before = boxplot(s_data)
```

See, what is close to a normal distribution ( more apparent after the cleanse ) looks so skewed because of the outlier.

```# Get the outlisers
> b_before\$out
> b_before\$out
   13.1 -100.0
```

And the outliers are given in the out variable of the boxplot.

Let’s remove the outliers.

```# Remove the outliers
> s_data_normalize = ifelse(s_data %in% b_before\$out, NA, s_data)
```

Let’s do another boxplot now and see if things changed.

```# box plot again - Looking much better.
b_after = boxplot(s_data_normalize)
```

You can see that the new histogram is much better as well.

```> hist(s_data_normalize,breaks=20)
```