# How to do Linear Regression in R

R Interview Questions

R has built in linear regression functions. Most of the time, lm() function should be good enough.

### Simple Linear Regression

For example, if you wanted to find out the relationship between the waiting time and eruption time of the old faithful data, you can do it like this.

model = lm(waiting ~ eruptions , data = faithful)

> coef = model$coefficients
> coef
(Intercept) eruptions
33.47440 10.72964

require(ggplot2)
g = ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
geom_point ( ) +
geom_abline ( aes(intercept = coef[1],
slope = coef[2]))

### Multi linear Regression

Multi linear regression is where more than one variable is used as a predictor. For example, if you take the Boston Housing data in mlbench package,

> library(mlbench)
> head(BostonHousing)
crim zn indus chas nox rm age dis rad tax ptratio b lstat medv
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
6 0.02985 0 2.18 0 0.458 6.430 58.7 6.0622 3 222 18.7 394.12 5.21 28.7

medv (Median Value in 1000s. For example, 24.0 is 24,000 $) is the response variable and the rest are all predictors. As usual, not all of the predictors have the same “predictive” power. There are many methods used to identify the best predictors. Let’s just use all the predictors and see how the linear model works.

> model = lm ( medv ~ . , data = BostonHousing)
> summary(model)
Call:
lm(formula = medv ~ ., data = BostonHousing)
Residuals:
Min 1Q Median 3Q Max
-15.595 -2.730 -0.518 1.777 26.199
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 3.646e+01 5.103e+00 7.144 3.28e-12 ***
crim -1.080e-01 3.286e-02 -3.287 0.001087 **
zn 4.642e-02 1.373e-02 3.382 0.000778 ***
indus 2.056e-02 6.150e-02 0.334 0.738288
chas1 2.687e+00 8.616e-01 3.118 0.001925 **
nox -1.777e+01 3.820e+00 -4.651 4.25e-06 ***
rm 3.810e+00 4.179e-01 9.116 < 2e-16 ***
age 6.922e-04 1.321e-02 0.052 0.958229
dis -1.476e+00 1.995e-01 -7.398 6.01e-13 ***
rad 3.060e-01 6.635e-02 4.613 5.07e-06 ***
tax -1.233e-02 3.760e-03 -3.280 0.001112 **
ptratio -9.527e-01 1.308e-01 -7.283 1.31e-12 ***
b 9.312e-03 2.686e-03 3.467 0.000573 ***
lstat -5.248e-01 5.072e-02 -10.347 < 2e-16 ***
---
Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared: 0.7406, Adjusted R-squared: 0.7338
F-statistic: 108.1 on 13 and 492 DF, p-value: < 2.2e-16

Or you can use a specific subset of the predictors based on step wise regression in R. Say you have determined that the best predictors are

- crim ( Crime rate )
- chas ( Dummy variable )
- nox ( nitrous oxide )
- rm ( Number of rooms )

The final model now is

> model = lm ( medv ~ crim + chas + nox + rm, data = BostonHousing)
> summary(model)
Call:
lm(formula = medv ~ crim + chas + nox + rm, data = BostonHousing)
Residuals:
Min 1Q Median 3Q Max
-21.797 -3.194 -0.803 2.461 39.301
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -17.25962 3.21548 -5.368 1.22e-07 ***
crim -0.18461 0.03449 -5.353 1.32e-07 ***
chas1 4.67381 1.06618 4.384 1.42e-05 ***
nox -14.96036 2.63920 -5.669 2.43e-08 ***
rm 7.70684 0.40273 19.137 < 2e-16 ***
---
Signif. codes: 0 â€˜***â€™ 0.001 â€˜**â€™ 0.01 â€˜*â€™ 0.05 â€˜.â€™ 0.1 â€˜ â€™ 1
Residual standard error: 5.987 on 501 degrees of freedom
Multiple R-squared: 0.5796, Adjusted R-squared: 0.5763
F-statistic: 172.7 on 4 and 501 DF, p-value: < 2.2e-16

### Formula to specify multi linear regression

model = lm ( medv ~ crim + chas + nox + rm, data = BostonHousing)

The formula to specify multiple variables in a liner regression is easy. Just separate all the variables with a + .