How to do Linear Regression in R

How to do Linear Regression in R


  R Interview Questions

R has built in linear regression functions. Most of the time, lm() function should be good enough.

Simple Linear Regression

For example, if you wanted to find out the relationship between the waiting time and eruption time of the old faithful data, you can do it like this.

model = lm(waiting ~ eruptions , data = faithful)
> coef = model$coefficients
> coef
(Intercept)   eruptions 
   33.47440    10.72964
require(ggplot2)
g = ggplot(data = faithful, aes(x = eruptions, y = waiting)) +
    geom_point  ( ) +
    geom_abline ( aes(intercept = coef[1], 
                      slope     = coef[2]))

Multi linear Regression

Multi linear regression is where more than one variable is used as a predictor. For example, if you take the Boston Housing data in mlbench package,

> library(mlbench)
> head(BostonHousing)
     crim zn indus chas   nox    rm  age    dis rad tax ptratio      b lstat medv
1 0.00632 18  2.31    0 0.538 6.575 65.2 4.0900   1 296    15.3 396.90  4.98 24.0
2 0.02731  0  7.07    0 0.469 6.421 78.9 4.9671   2 242    17.8 396.90  9.14 21.6
3 0.02729  0  7.07    0 0.469 7.185 61.1 4.9671   2 242    17.8 392.83  4.03 34.7
4 0.03237  0  2.18    0 0.458 6.998 45.8 6.0622   3 222    18.7 394.63  2.94 33.4
5 0.06905  0  2.18    0 0.458 7.147 54.2 6.0622   3 222    18.7 396.90  5.33 36.2
6 0.02985  0  2.18    0 0.458 6.430 58.7 6.0622   3 222    18.7 394.12  5.21 28.7

medv (Median Value in 1000s. For example, 24.0 is 24,000 $) is the response variable and the rest are all predictors. As usual, not all of the predictors have the same “predictive” power. There are many methods used to identify the best predictors. Let’s just use all the predictors and see how the linear model works.

> model = lm ( medv ~ . , data = BostonHousing)
> summary(model)
Call:
lm(formula = medv ~ ., data = BostonHousing)
Residuals:
    Min      1Q  Median      3Q     Max 
-15.595  -2.730  -0.518   1.777  26.199 
Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  3.646e+01  5.103e+00   7.144 3.28e-12 ***
crim        -1.080e-01  3.286e-02  -3.287 0.001087 ** 
zn           4.642e-02  1.373e-02   3.382 0.000778 ***
indus        2.056e-02  6.150e-02   0.334 0.738288    
chas1        2.687e+00  8.616e-01   3.118 0.001925 ** 
nox         -1.777e+01  3.820e+00  -4.651 4.25e-06 ***
rm           3.810e+00  4.179e-01   9.116  < 2e-16 ***
age          6.922e-04  1.321e-02   0.052 0.958229    
dis         -1.476e+00  1.995e-01  -7.398 6.01e-13 ***
rad          3.060e-01  6.635e-02   4.613 5.07e-06 ***
tax         -1.233e-02  3.760e-03  -3.280 0.001112 ** 
ptratio     -9.527e-01  1.308e-01  -7.283 1.31e-12 ***
b            9.312e-03  2.686e-03   3.467 0.000573 ***
lstat       -5.248e-01  5.072e-02 -10.347  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 4.745 on 492 degrees of freedom
Multiple R-squared:  0.7406,    Adjusted R-squared:  0.7338 
F-statistic: 108.1 on 13 and 492 DF,  p-value: < 2.2e-16

Or you can use a specific subset of the predictors based on step wise regression in R. Say you have determined that the best predictors are

  • crim ( Crime rate )
  • chas ( Dummy variable )
  • nox ( nitrous oxide )
  • rm ( Number of rooms )

The final model now is

> model = lm ( medv ~ crim + chas + nox + rm, data = BostonHousing)
> summary(model)
Call:
lm(formula = medv ~ crim + chas + nox + rm, data = BostonHousing)
Residuals:
    Min      1Q  Median      3Q     Max 
-21.797  -3.194  -0.803   2.461  39.301 
Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) -17.25962    3.21548  -5.368 1.22e-07 ***
crim         -0.18461    0.03449  -5.353 1.32e-07 ***
chas1         4.67381    1.06618   4.384 1.42e-05 ***
nox         -14.96036    2.63920  -5.669 2.43e-08 ***
rm            7.70684    0.40273  19.137  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5.987 on 501 degrees of freedom
Multiple R-squared:  0.5796,    Adjusted R-squared:  0.5763 
F-statistic: 172.7 on 4 and 501 DF,  p-value: < 2.2e-16

Formula to specify multi linear regression

model = lm ( medv ~ crim + chas + nox + rm, data = BostonHousing)

The formula to specify multiple variables in a liner regression is easy. Just separate all the variables with a + .

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: