What are the R packages you used most in your previous projects

What are the R packages you used most in your previous projects


  R Interview Questions

The packages that any project typically works with are based on the nature of the project. For example, a biotech company uses packages like bioconductor, CROME etc. A stock analytics company might use packages like quantmod, stocks etc.

In general however, there are packages that every project uses ( more or less ). Here is a quick listing ( of course not exhaustive ).

  • Data Import
    • readxl ( import excel files into R )
    • readr ( import csv and other file formats into R )
  • Data Preparation
    • dplyr ( Easy to use functions for data manipulation )
    • tidyr ( To tidy up data )
    • data.table ( Large data sets )
    • sqldf ( Run SQL queries on data frames/tables )
  • Data Visualization ( EDA )
    • ggplot2 ( Core Grammar of Graphics package from Hadley Wickham )
    • lattice ( Data visualization )
    • ggvis ( Interactive graphics, something ggplot can’t do )
    • plot.ly (Interactive graphics )
    • shiny ( HTML markdown in R )
  • Data Modeling ( Machine Learning )
    • caret ( Modelling for Classification and regression problems )
    • car
    • e1071
    • RandomForest
    • rpart ( Partition data )
    • nnet ( Single Layer Neural Network )

What is the difference between ilbrary() and require() functions in R

What is the difference between ilbrary() and require() functions in R


  R Interview Questions

Both these functions are functionally equivalent. They load the required package/s. However, they silently return different values.

library function returns the list of attached packages.

> a = library(caret)
> a
 [1] "caret"     "lattice"   "caTools"   "e1071"     "ggplot2"   "stats"    
 [7] "graphics"  "grDevices" "utils"     "datasets"  "methods"   "base"

require function on the other hand returns if a package is loaded or not

> b = require(caret)
> b
[1] TRUE

How to install a package in R

How to install a package in R


  R Interview Questions

Packages is where the power of R lies. R has close to 10,000 packages available at CRAN. There is a package for just about anything in R. It is very easy to install packages in R.

Install a packages

Installing a package in R is simple. For example, the following statement installs the “plotly” package.

> install.packages("plotly")

Install multiple packages in one go

<strong>> install.packages( c("plotly","ggplot2") )</strong>

Install packages in R Studio

R studio makes life easy by providing a GUI for this.

you can specify multiple packages as well.

Where are these packages installed ?

These packages are installed in the library path which you can find out through

> .libPaths()
[1] "C:/Users/AjayTech/Documents/R/win-library/3.4"
[2] "C:/Program Files/R/R-3.4.3/library"

How can you load packages ?

Installing a library is one thing, but you have to load them to be usable by your programs. You load them as below.

# Note that there is no quotes this time. 
> library(ggplot2)

How can you remove a package in R ?

It is as easy as installing a package

> remove.packages( c("ggplot2","plotly"))
Removing packages from 'C:/Users/AjayTech/Documents/R/win-library/3.4'
(as 'lib' is unspecified)

Or using the R Studio GUI.

How to get help on a particular R package

> help(package="caret")

How to get samples from a data set in R

How to get samples from a data set in R


  R Interview Questions

SAMPLE () function

# Generates 10 random numbers between 1 and 100
> sample(1:100,10)
 [1] 31 26 55  6 45 46 77 35 51 16

You can do samples with replacement as well.

# Generate 10 random numbers between 1 and 100 with replacement
> sample(1:100,10, replace = TRUE)
 [1] 54 72 54 75 43 18 78 89 55 28

Sampling is also used to generate training and test data from a dataset.

What is the difference between runif () and rnorm () functions in R

What is the difference between runif () and rnorm () functions in R


  R Interview Questions

RUNIF ()

runif function generates a list of random numbers between an interval. For example,

# Generates 100 random numbers between 0 and 100
> runif(100, min = 0, max = 100)

RNORM ()

rnorm function generates a set of random numbers with a defined mean and standard deviation. For example,

# Generates 100 random numbers with a mean of 100 and standard deviation of 10
> rnorm(100,mean=100,sd=10)

Click here if you want togenerate the same set of random numbers every time in R ( Useful during testing )

How to generate the same set of random numbers in R

How to generate the same set of random numbers in R


  R Interview Questions

Some times in testing, it is necessary to generate the same set of random numbers every time the program is run. Use the following methods to achieve it.

Reuse

Save the same set of random numbers and use it.

# Use the dist variable every time you need the same distribution
> dist = rnorm(100, mean=100, sd=10 )

SET.SEED ( )

Use can use the same seed to get the same distribution every time you run rnorm

> set.seed(100)
> rnorm(100,mean=100,sd=10)

How to get correlation matrix for a set of variables in R

How to get correlation matrix for a set of variables in R


  R Interview Questions

Correlation coefficient between 2 variables ( or any number of variables ) can be found using the cor function.

For example, if you want to find out the correlation coefficient between the sepal’s length and width of the iris data set, use

> cor(iris[1:2],)
             Sepal.Length Sepal.Width
Sepal.Length    1.0000000  -0.1175698
Sepal.Width    -0.1175698   1.0000000

You can use the cor () function to find out the correlation matrix ( a matrix of correlation coefficients ) across all the variables in the iris data set as well.

> cor(iris[1:4])
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

How do you calculate mean, median, MAD and SD in R

How do you calculate mean, median, MAD and SD in R


  R Interview Questions

Let’s take a normal distribution.

> dist = rnorm(1000,mean=100,sd=10)
> plot(density(dist))

Mean

> mean(dist)
[1] 100.0442

Median

> median(dist)
[1] 100.0193

Mean Average Deviation

> mad(dist)
[1] 9.395154

Standard Deviation

> sd(dist)
[1] 9.87821

How is kurtosis of a distribution calculated in R

How is kurtosis of a distribution calculated in R


  R Interview Questions

While skew measures if the distribution is left skewed or right skewed, kurtosis measures if the tail is thin or thick.

A tail is thick if the standard distribution is large. For example,

> p_kurtosis = data.frame(v=rnorm(1000,mean=100,sd=50))

Similarly, if the sd is small ( more tightly packed around the center ), the tail is thin.

> n_kurtosis = data.frame(v=rnorm(1000,mean=100,sd=10))

Let’s visualize this.

ggplot() + 
    geom_density(data = p_kurtosis,aes(p_kurtosis,fill="red",alpha=0.1)) + 
    geom_density(data = n_kurtosis, aes(n_kurtosis,fill="green",alpha=0.1))

The distribution in green has a thick tail , hence a positive kurtosis. The distribution in red has a relatively think tail and hence a negative kurtosis.

> kurtosis(p_kurtosis$v)
[1] 0.1529404
> kurtosis(n_kurtosis$v)
[1] -0.01435051

skewness and kurtosis are a bit related. Learn more about how to measure skewness of a distribution in R here.

How is skewness of a distribution calculated in R

How is skewness of a distribution calculated in R


  R Interview Questions

Skew measures the asymmetry of a distribution. There is a right skew ( long tail to the right ) and a left skew ( vice-versa ). Let’s overlay a left and right skewed distribution and see how they look.

# A right skewed distribution 
> right_skew = data.frame(r=rbeta(1000,5,1))
# A left skewed distribution
> left_skew = data.frame(l=rbeta(1000,1,5))

And let’s plot this to get a better picture.

ggplot() + 
    geom_density(data = left_skew,aes(left_skew,fill="red",alpha=0.1)) + 
    geom_density(data = right_skew, aes(right_skew,fill="green",alpha=0.1)) +
    geom_vline(aes(xintercept = mean(left_skew$l)),color="red",linetype="dashed") +
    geom_vline(aes(xintercept = median(left_skew$l)),color="red") +  
    geom_vline(aes(xintercept = mean(right_skew$r)),color="green",linetype="dashed") +
    geom_vline(aes(xintercept = median(right_skew$r)),color="green")

Don’t let the syntax get you. Focus on the picture below.

The red distribution is right skewed and the green distribution is left skewed. Let’s find out the skew ness now.

> library(e1071)
> mean(left_skew$l)   # Mean is > Median in a left skewed distribution
[1] 0.1641156
> median(left_skew$l)
[1] 0.1298181
> skewness(left_skew$l)   # Skewness of a left skewed distribution is positive.
[1] 1.1802

Similarly, a right skewed distribution

> mean(right_skew$r)
[1] 0.8281012
> median(right_skew$r)
[1] 0.8659932
> skewness(right_skew$r)
[1] -1.16096