# What is the difference between sapply vs for loop

Let’s make a small data set from iris.

```# Get 9 rows ( sampled across all species ) and lose the species column.
> iris_small = iris[c(1:3,51:53,101:103),-5]
> iris_small
Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1
```

Say, we want to get the mean across all columns.

### 1. You can go the apply () route

```> apply(iris_small, 2, mean)
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
6.022222     3.133333     3.911111     1.277778

```

The second parameter to the apply () function dictates whether the operation needs to be done row-wise or column-wise. For example, in this case, we wanted mean column-wise and that’s why we sent 2 as the parameter. If we wanted mean across rows, use 1.

### 2. or use the loop ( say for loop )

```for(col in colnames(iris_small)){
m = mean(iris_small[[col]])
print(m)
}
[1] 6.022222
[1] 3.133333
[1] 3.911111
[1] 1.277778

```

### Which is faster – apply () or for loop ?

apply() function is essentially for loop under the hood. So, neither are faster than each other. However, the advantage of apply() function is readability. for loop comes with some paraphernalia –

• Specify the iterative methods
• Flower brackets for open and close etc

which can make things a little less readable.

### Limitations on apply ()

apply() function is only applicable to matrices or arrays – not to data frames. If there is a string for example, all the rest of the numeric elements are coerced to strings. This might be a problem in data frames when you want to perform row-wise operations ( which are rare anyway ). However, when you are doing column-wise operations, if all of the elements are numeric, then you are good to go.

# What is the difference between lapply vs sapply

For a more comprehensive list of R Interview questions, visit the link.

sapply () is just a wrapper around lapply (). lapply returns a list whereas sapply returns a vector.

```# Here is a small data frame of iris data set.
> iris_small
Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1
```

Let’s convert this into a List ( which you don’t need to to perform lapply function, but just play along for now )

```> iris_small_list = as.list(iris_small)
\$Sepal.Length
[1] 5.1 4.9 4.7 7.0 6.4 6.9 6.3 5.8 7.1
\$Sepal.Width
[1] 3.5 3.0 3.2 3.2 3.2 3.1 3.3 2.7 3.0
\$Petal.Length
[1] 1.4 1.4 1.3 4.7 4.5 4.9 6.0 5.1 5.9
\$Petal.Width
[1] 0.2 0.2 0.2 1.4 1.5 1.5 2.5 1.9 2.1

```

Let’s calculate the mean across each of the list elements

```> means = lapply(iris_small_list,mean)
> means
\$Sepal.Length
[1] 6.022222
\$Sepal.Width
[1] 3.133333
\$Petal.Length
[1] 3.911111
\$Petal.Width
[1] 1.277778
```

Each element of the list element gets sent to the mean function as an argument and the result is returned in a list finally.

Now, let’s perform the same with sapply

```> means = sapply(iris_small_list,mean)
> means
Sepal.Length  Sepal.Width Petal.Length  Petal.Width
6.022222     3.133333     3.911111     1.277778

```

This time the list elements are converted to a vector.

# What is ‘Stack’ing and ‘Unstack’ing in R

Stacking and unstacking are function that help in reshaping data in R. Data is not always in the format you require. For example, say you have data from 3 temperature sensors in different areas of chicago. And the data you get from these sensors is horizontal.

Even if you wanted to compare the temperatures across sensors, this is not a good format. Instead, if all the temperature data were in a single column, with the second column identifying where the sensor data came from, you can analyze the data much better. This is one of the data reshaping steps.

```# Let's take the first 5 rows.
> sensors
sensor_1 sensor_2 sensor_3
1      105       98       90
2      105      113       98
3      102       92      110
4       95      101      103
5       97      113       96

```

How do we stack these 3 columns into the same column ? Try this

```> sensors = stack(sensors)
> sensors
values      ind
1     105 sensor_1
2     105 sensor_1
3     102 sensor_1
4      95 sensor_1
5      97 sensor_1
6      98 sensor_2
7     113 sensor_2
8      92 sensor_2
9     101 sensor_2
10    113 sensor_2
11     90 sensor_3
12     98 sensor_3
13    110 sensor_3
14    103 sensor_3
15     96 sensor_3
```

R created a new column automatically, called ind , to indicate which column the row came from. If you did not want to stack all the columns, you can very well specify the columns.

```# say you only wanted the first and third sensor's data
> sensors = stack(sensors, select = c("sensor_1","sensor_3"))
> sensors
values      ind
1     105 sensor_1
2     105 sensor_1
3     102 sensor_1
4      95 sensor_1
5      97 sensor_1
6      90 sensor_3
7      98 sensor_3
8     110 sensor_3
9     103 sensor_3
10     96 sensor_3
```

Unstacking

Unstacking the data is just as easy. Let’s unstack the dataframe we just stacked.

```> sensors = unstack(sensors)
> sensors
sensor_1 sensor_3
1      105       90
2      105       98
3      102      110
4       95      103
5       97       96

```

That was easy. Wasn’t it ?

# How to visualize multi-dimensional data in R

Multi-dimensional data cannot be visualized easily. Here are some methods.

### 2D Scatter plots

Scatter plots typically are 2 dimensional. You can use the following methods to include more variables into the plot.

Color

Color can be used to map the third variable in a scatter plot.This is typically used when the third variable is categorical.

```> attach(iris)
> plot = plot(Sepal.Length,Sepal.Width,col=Species)
```

Shape

Shape can also be used to show the third variable. Similar to color, using shape for the 3rd variable makes sense when it is categorical. You can also use a combination of color and shape like below.

```plot = plot(Sepal.Length,Sepal.Width,type="p",
pch = c(16, 17, 18)[as.numeric(Species)],
col = c("red", "green","blue")[as.numeric(Species)])

```

Size

If the 3rd parameter is continuous, you can use the size to show big vs small values.

```> plot = plot(Sepal.Length,Sepal.Width,type="p",
cex = Petal.Length,
bg = Species, # Background color
pch = 21)
> legend("topleft",legend = unique(Species),
col = c("black","red","green"),
lty = 1:2)

```

The parameter cex controls the scaling of the dots. In fact, we were able to plot 4 parameters here

• Sepal.Length ( x- axis )
• Sepal.Width ( y-axis )
• Species ( with color )
• Petal.Length ( with size )

One conclusion here could be, the Petal Length is greater for Virginca in general than Setosa species.

### 3D Scatter plots

3 dimensional plots takes this to the next level. You can use the “z” dimension to map a third variable.

```col = c("#FF0000","#00FF00","#0000FF")
col = col[as.numeric(Species)]
scatterplot3d(Sepal.Length,Sepal.Width,Petal.Length,
bg=Species,
color = col)

```

# How to make interactive plots in R

Base R as well as ggplot does not have interactive plots. You can use third party libraries like plotly to produce interactive plots. Say, you have a data frame like this and you want to do an (x,y) interactive plot

1. Using PLOTLY

```> data
x   y
1   1   1
2   2   4
3   3   9
4   4  16
5   5  25
6   6  36
7   7  49
8   8  64
9   9  81
10 10 100
```

Step 1 – Start by making a plot in ggplot

```> ggplot = ggplot(data,aes(x,y)) + geom_point()

```

Step 2 – pass this object to the GGPLOTLY() function

```> ggplotly = ggplotly(ggplot)
```

That’s it.

2. Using GGVIS

ggvis is another interactive graphing package from the makers of R studio. For example, the same plot can be made in GGVIS as below (for the same set of data).

```> ggplot = ggvis(data, x = ~x, y = ~y )
> layer_points(ggplot)

```

# How to add grids to plots in R

The plot function in Base R does not automatically put up a grid. For example, if you do a simple plot like this,

```# Plot y = square(x) for x between 1 and 10
> plot(1:10,(1:10)^2)

```

It is pretty hard to figure out where exactly each of these points lie. A grid would be nice to quickly get the x and y values.

```# nx stands for the number of vertical lines on the x-axis.
# ny = nx is a default argument
> grid( nx= 10 )

```

The default grid lines are grey in color. You can change it by specifying the col argument.

```> grid( nx = 10, col = "lightblue")
```

Check out this post on how to make interactive plots in R using plotly

# How to view scatter plots for more than 2 variables in R

plotting 2 variables is probably the most common graphic you would use to determine patterns between them. However, if there are more than 2 variables and you want to find out visually which of these variables are correlated, you might want to have a scatter plot across all these variables. This is more or less a visual version of the cor () function in R.

For example, there are 4 variables in the iris data set that can be used to identify the species. However, we want to identify if there is any correlation across these variables. The easier way to do it is via the pairs () function.

1. Using PAIRS () function

```# We only want to find out the correlation between the first 4 fields.
> pairs(iris[,1:4])

```

And R gives out this convenient plot that does plotting across each pair or fields. In case there is confusion reading this graph, the way to interpret it is as below.

For example, the plot in the red box is equivalent to

```> library(GGally)
> ggpairs(iris[1:4])

```

and the plot in the blue box is equivalent to

``` > plot(iris\$Petal.Length ~ iris\$Petal.Width)
```

2. Using GGPAIRS () function from the library GGally

This one is like a turbo version of the pairs () function. It gives you

• correlation plot
• correlation coefficients
• histogram.
```> library(GGally)
> ggpairs(iris[1:4])

```

and the output looks something like this.

# Why are box plots used in R

If you already know how to create box plots in R, let’s dig into the “why” of box plots.

• Visually represent the following attributes.
• minimum
• maximum
• median
• quartiles
• outliers
```# box plot for the "sepal length" of the setosa species
> b = boxplot(Sepal.Length ~ Species, data = iris[iris\$Species=="setosa",])
```
```> b\$stats
[,1] [,2] [,3]
[1,]  4.3   NA   NA
[2,]  4.8   NA   NA
[3,]  5.0   NA   NA
[4,]  5.2   NA   NA
[5,]  5.8   NA   NA

```
• Visually compare distributions
```> b = boxplot(Sepal.Length ~ Species, data = iris)
> b\$stats
[,1] [,2] [,3]
[1,]  4.3  4.9  5.6
[2,]  4.8  5.6  6.2
[3,]  5.0  5.9  6.5
[4,]  5.2  6.3  6.9
[5,]  5.8  7.0  7.9

```

You can visually draw quick conclusions on the ‘data spread’ across data points – in this case species. For example, you can understand that

• sepal length are markedly distinct across the species. That is good news for classification.
• sepal length of setosa falls in a markedly different bucket. There is some overlap across versicolor and virginica species.

# How to create a boxplot in R

Box plot is one of the most intuitive tool to measure the spread of a variable. It is also called as a box-and-whisker plot.

For example, if you want to find out how the spread of the petal length is across the different iris flower species.

```> b = boxplot(Sepal.Length ~ \$Species, data = iris)
```

# How to create a histogram in R

A histogram is a wonderful tool to check the spread of data – specifically frequency distribution.

### 1. Basic histogram

```> hist(iris\$Sepal.Length)
```

### 2. Specify ‘breaks’

Since histogram uses binning to ‘break’ the linear distribution of numeric data across bins, specifying breaks gives you more control over the construction of the histogram for small data sets.

```> hist(iris\$Sepal.Length,breaks=20)
```

### 3. Density plot

Histograms by default are based on frequency distribution. You could also base it on density.

```> h = hist(iris\$Sepal.Length,breaks=20,freq=FALSE)

```

You can get the actual values from

```> h\$density
[1] 0.13333333 0.16666667 0.23333333 0.53333333 0.43333333 0.23333333
[7] 0.43333333 0.50000000 0.30000000 0.33333333 0.53333333 0.23333333
[13] 0.36666667 0.16666667 0.13333333 0.06666667 0.03333333 0.13333333
[19] 0.03333333

```

asdf