# How to create a pie chart in R

Pie charts do not represent a lot of value from a statistician’s perspective, but they are a simple way to show proportions in an easy manner to the layman.

```> summary = table(mtcars\$cyl)
> summary
4  6  8
11  7 14
> pie(summary)

```

Unfortunately, standard pie() function does not allow percentages to show up on the pie chart. We would have to calculate it ourselves manually.

```> labels = summary * 100 / sum(summary)
> labels
4      6      8
34.375 21.875 43.750
> labels = round(labels)
> labels
4  6  8
34 22 44
> pie(summary,labels)
```

# How to find out elements in one vector that are not in another vector

Say you have 2 vectors,

```> all_cities = c("hyderabad","delhi","mumbai","koklata","chennai")

```

How do you find out the cities that are in the north ?

```> north_cities = setdiff(all_cities,south_cities)
> north_cities
[1] "delhi"   "mumbai"  "koklata"

```

The setdiff() function is part of the set operations like union(), intersect() and setequal()

In this case, all_cities is a superset and south_cities is a subset. These could be disjoint sets as well. For example, think of all cities with

• a population > 10M
• a metro rail transport
```> cities_10m = c("hyderabad","delhi","mumbai","koklata","chennai")

```
```# cities with metro but do not have a population of 10 M
> setdiff(cities_metro,cities_10m)
[1] "bangalore
```
```# cities that have a population of 10M without a metro
> setdiff(all_cities,cities_metro)
[1] "mumbai"  "koklata"

```

Now, you should be able to understand why setdiff() is a asymmetrical function

```# cities that have a population of 10M with a Metro
> intersect(all_cities,cities_metro)

```
```# cities with either a population of 10M or with a Metro
> union(cities_metro,all_cities)
[4] "bangalore" "mumbai"    "koklata"
```

# What is Vector Recycling in R ?

Operations on vectors need to happen on vectors of the same size. Say, you add

```> cats
John  Ajay Aishu
1     2     1
> dogs
John  Ajay Aishu  Mary Emily  Siva
0     1     2     0     2     1
> pets = cats + dogs
> pets
John  Ajay Aishu  Mary Emily  Siva
1     3     3     1     4     2

```

As you can see, the elements in the cats vector are recycled to be equal to the size of the dogs vector in order to do the addition. Since R needs the vectors to be of the same length to perform operations on them, it is auto-adjusting the length by recycling the existing elements. In fact, this is what is happening when you perform operations on vectors with a single number

```> avg = pets-mean(pets)
> avg
John       Ajay      Aishu       Mary      Emily       Siva
-1.3333333  0.6666667  0.6666667 -1.3333333  1.6666667 -0.3333333
```

Although the mean(pets) looks like a single numeric number, you should know by now that it is represented internally as a vector in R

```> mean(pets)
[1] 2.333333
> class(mean(pets))
[1] "numeric"

```

# How to remove duplicate rows from a data frame in R

### 1 – DUPLICATED ()

The last 3 rows are duplicates of rows 4, 5 and 6.

```> iris_dup
Sepal.Length Sepal.Width Petal.Length petal.Width    species
1            5.1         3.5          1.4         0.2     Setosa
2            4.9         3.0          1.4         0.2     Setosa
3            4.7         3.2          1.3         0.2     Setosa
51           7.0         3.2          4.7         1.4  Virginica
52           6.4         3.2          4.5         1.5  Virginica
53           6.9         3.1          4.9         1.5  Virginica
101          6.3         3.3          6.0         2.5 Versicolor
102          5.8         2.7          5.1         1.9 Versicolor
103          7.1         3.0          5.9         2.1 Versicolor
511          7.0         3.2          4.7         1.4  Virginica
521          6.4         3.2          4.5         1.5  Virginica
531          6.9         3.1          4.9         1.5  Virginica

```

Step 1 – Find out the duplicates

```duplicated(iris_dup)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE
```

Step 2 – Use this vector to select only the duplicate rows

```> iris_dup[duplicated(iris_dup),]
Sepal.Length Sepal.Width Petal.Length petal.Width   species
511          7.0         3.2          4.7         1.4 Virginica
521          6.4         3.2          4.5         1.5 Virginica
531          6.9         3.1          4.9         1.5 Virginica
```

or have only the non-duplicate rows.

```> iris_dup[!duplicated(iris_dup),]
Sepal.Length Sepal.Width Petal.Length petal.Width    species
1            5.1         3.5          1.4         0.2     Setosa
2            4.9         3.0          1.4         0.2     Setosa
3            4.7         3.2          1.3         0.2     Setosa
51           7.0         3.2          4.7         1.4  Virginica
52           6.4         3.2          4.5         1.5  Virginica
53           6.9         3.1          4.9         1.5  Virginica
101          6.3         3.3          6.0         2.5 Versicolor
102          5.8         2.7          5.1         1.9 Versicolor
103          7.1         3.0          5.9         2.1 Versicolor

```

### 2. UNIQUE () function

This is much more straightforward.

```> unique(iris_dup)
Sepal.Length Sepal.Width Petal.Length petal.Width    species
1            5.1         3.5          1.4         0.2     Setosa
2            4.9         3.0          1.4         0.2     Setosa
3            4.7         3.2          1.3         0.2     Setosa
51           7.0         3.2          4.7         1.4  Virginica
52           6.4         3.2          4.5         1.5  Virginica
53           6.9         3.1          4.9         1.5  Virginica
101          6.3         3.3          6.0         2.5 Versicolor
102          5.8         2.7          5.1         1.9 Versicolor
103          7.1         3.0          5.9         2.1 Versicolor
```

# How to remove particular values from a data frame in R

Say you are compiling the temperature (in Celsius) in Chicago from 5 sensors each located at 5 geographically different locations. You might want to perform different kinds of analysis on it. So, you put it in a data frame as 5 different features

```# Data from 5 sensors for the week
> s1 = c(12.4, 13.5, 15.6, 20, 21.5, 13.6, 12.4)
> s2 = c(12.3, 13.9, 15.3, 20.4, 21.9, 13.4, 12.7)
> s3 = c(11.9, 12.8, 15.4, 19.8, 21.3, 13.8, 12.2)
> s4 = c(12.7, 13.2, 14.7, 20.9, 20.9, 13.2, 12.1)
> s5 = c(12.1, 13.1, -100, 19.8, 20.3, 12.9, 12.3)
# Create a data frame
> temp_week = data.frame(s1,s2,s3,s4,s5)
s1   s2   s3   s4   s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7 -100
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3

```

As you can see there is a wierd data point – say all malfunctioning sensors send -100 as the output. So, you know for sure that -100 needs to be eliminated and replaced with NA for now.

A simple way to do it is by just saying

```> temp_week[3,5]=NA

```

But, you wouldn’t know which sensor failed right ? Imagine this were thousands of rows of data. You need a simple way to replace all malfunctioning sensor data ( -100 value ) with NA

Step 1 – Figure out which value in each column has -100. We are starting with the 5th column just for convenience

```# The third row of the fifth sensor is failing
> temp_week\$s5 == -100
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE
```

Step 2 – Send this vector of T/F as the index to the data frame column will return just that element

```> temp_week\$s5[temp_week\$s5 == -100]
[1] -100

```

Step 3 – Now that we know how to identify the element in a column , set it to NA

```> temp_week\$s5[temp_week\$s5 == -100] = NA
> temp_week\$s5
[1] 12.1 13.1   NA 19.8 20.3 12.9 12.3

```

Step 4 – We can repeat this across all the columns. Let’s use a for loop

```for(i in names(temp_week)){
temp_week\$i[temp_week\$i == -100] = NA
}
Error in `\$<-.data.frame`(`*tmp*`, "i", value = logical(0)) :
replacement has 0 rows, data has 7
```

Why do you see an error ? This is because temp\$i is not being replaced with temp\$s1 from the names of the columns. This is because the \$ function in data frame works like a string literal and cannot be a dynamically changing variable. A more programmatic way to access columns of a data frame in via the double square brackets [[ ]]. Let’s try this again.

```for(i in names(temp_week)){
temp_week[[i]][temp_week[[i]] == -100] = NA
}
> temp_week
s1   s2   s3   s4   s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7   NA
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3

```

A shorter way would be to use lapply() function, but a for loop explains what is happening much better.

So far so good. Let’s go one step further and replace NA with the average value of the remaining sensors for the day. That way, we can better use the available data (otherwise, we would probably have to throw away that row).

### Replace NA with the average of the rows

Step 1 – Find out the mean for each row.

```# Find the mean for each row using rowMeans method
row_means = rowMeans(temp_week, na.rm=TRUE)
> row_means
[1] 12.28 13.30 15.25 20.18 21.18 13.38 12.34

```

Step 2 – Replace the NA with the average of each row

```# column by column manually
temp_week\$s5[is.na(temp_week\$s5)] = row_means[is.na(temp_week\$s5)]

```
```# or via a loop ( This looks scary at first, because of the brackets
# but it is just the statement above with \$s5 generalized into a column name
for ( name in colnames(temp_week)){
temp_week[[name]][is.na(temp_week[[name]])] =
row_means[is.na(temp_week[[name]])]
}

```

# How to remove all rows in a data frame with NAs in R

There is no straight forward way to remove NAs from a data frame in R. Let’s do this step by step.

```> persons
age names        cities   zip
1  NA  Ajay San Francisco 94000
3  21  Mary     Sunnyvale 94010
4  60 Aishu      San Jose 94001

```

Step 1 – Identify all the elements that are NA

```> is.na(persons)
age names cities   zip
[1,]  TRUE FALSE  FALSE FALSE
[2,] FALSE FALSE   TRUE FALSE
[3,] FALSE FALSE  FALSE FALSE
[4,] FALSE FALSE  FALSE FALSE

```

Step 2 – Inverse this to only show NAs as FALSE

```> !is.na(persons1)
age names cities  zip
[1,] FALSE  TRUE   TRUE TRUE
[2,]  TRUE  TRUE  FALSE TRUE
[3,]  TRUE  TRUE   TRUE TRUE
[4,]  TRUE  TRUE   TRUE TRUE
```

Step 3 – Find out if all elements in a row are true

Step 4 – Apply this function across all rows of the data frame

```# This gives a list of rows that have at least one NA
> apply(!is.na(persons1),1,all)
[1] FALSE FALSE  TRUE  TRUE
```

The first argument of apply() function is the data frame itself, the second function is 1( for row wise operation ) and all is a function to be called for each row.

Step 5 – Now let’s put it all together.

```> persons = persons[apply(!is.na(persons),1,all),]
> persons
age names    cities   zip
3  21  Mary Sunnyvale 94010
4  60 Aishu  San Jose 94001

```

# How to delete rows from a data frame in R

Deleting rows is typically done using sub-setting.

For example, you want to delete all rows where species = “setosa” or “virginica”

```> iris_small
Sepal.Length Sepal.Width Petal.Length petal.Width    species
1            5.1         3.5          1.4         0.2     Setosa
2            4.9         3.0          1.4         0.2     Setosa
3            4.7         3.2          1.3         0.2     Setosa
51           7.0         3.2          4.7         1.4  Virginica
52           6.4         3.2          4.5         1.5  Virginica
53           6.9         3.1          4.9         1.5  Virginica
101          6.3         3.3          6.0         2.5 Versicolor
102          5.8         2.7          5.1         1.9 Versicolor
103          7.1         3.0          5.9         2.1 Versicolor

```

Use sub-setting and assign to a new variable

```> iris_versicolor = iris_small[iris_small\$species %in% c("setosa","Virginica"),]
> iris_versicolor
Sepal.Length Sepal.Width Petal.Length petal.Width   species
51          7.0         3.2          4.7         1.4 Virginica
52          6.4         3.2          4.5         1.5 Virginica
53          6.9         3.1          4.9         1.5 Virginica
```

# How to delete columns from a data frame in R

### 1 – Setting the column to NULL

Say you want to remove the species column.

```> iris_small
Sepal.Length Sepal.Width Petal.Length petal.Width    species
1            5.1         3.5          1.4         0.2     Setosa
2            4.9         3.0          1.4         0.2     Setosa
3            4.7         3.2          1.3         0.2     Setosa
51           7.0         3.2          4.7         1.4  Virginica
52           6.4         3.2          4.5         1.5  Virginica
53           6.9         3.1          4.9         1.5  Virginica
101          6.3         3.3          6.0         2.5 Versicolor
102          5.8         2.7          5.1         1.9 Versicolor
103          7.1         3.0          5.9         2.1 Versicolor

```

Set the column to NULL

```> iris_small\$species = NULL
> iris_small
Sepal.Length Sepal.Width Petal.Length petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1

```

### 2 – sub-setting

```> iris_no_species = iris_small[,-5]
> iris_no_species
Sepal.Length Sepal.Width Petal.Length petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1

```

## How can you add rows to a data frame in R

1 – RBIND ()

```> iris_small
Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1

```

Say we have a new observation we want to add to the data set above.

```> new_obs = c(8.1, 3.1,6.1,2.5)
```

use the RBIND function

```> rbind(iris_small,new_obs)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1
10           8.1         3.1          6.1         2.5

```

2 – RBIND() to more than one observation

In fact, RBIND () can be used to add more than just 1 observation – say add an entire data frame to another.

Let’s create a new data frame as a subset of iris

```> new_obs = iris_small[c(1,3,9),]
> new_obs
Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
3            4.7         3.2          1.3         0.2
103          7.1         3.0          5.9         2.1
```

new_obs is an entire new data frame with 3 rows. use RBIND() to add this new data frame to the iris data frame.

```> rbind(iris_small,new_obs)
Sepal.Length Sepal.Width Petal.Length Petal.Width
1             5.1         3.5          1.4         0.2
2             4.9         3.0          1.4         0.2
3             4.7         3.2          1.3         0.2
51            7.0         3.2          4.7         1.4
52            6.4         3.2          4.5         1.5
53            6.9         3.1          4.9         1.5
101           6.3         3.3          6.0         2.5
102           5.8         2.7          5.1         1.9
103           7.1         3.0          5.9         2.1
11            5.1         3.5          1.4         0.2
31            4.7         3.2          1.3         0.2
1031          7.1         3.0          5.9         2.1

```

Caution – Ensure that the column names are an exact match including case.

```> rbind(iris_small,new_obs)
Error in match.names(clabs, names(xi)) :
names do not match previous names
```

# How can you add columns or rows to a data frame in R

1 – CBIND () to add columns

```> iris_small
Sepal.Length Sepal.Width Petal.Length Petal.Width
1            5.1         3.5          1.4         0.2
2            4.9         3.0          1.4         0.2
3            4.7         3.2          1.3         0.2
51           7.0         3.2          4.7         1.4
52           6.4         3.2          4.5         1.5
53           6.9         3.1          4.9         1.5
101          6.3         3.3          6.0         2.5
102          5.8         2.7          5.1         1.9
103          7.1         3.0          5.9         2.1

```

Say, we want to add the species column.

```> species = c("Setosa","Setosa","Setosa","Virginica","Virginica","Virginica","Versicolor","Versicolor","Versicolor")
```

use “Column” bind – CBIND() function to add the species vector as a new column.

```> iris_small_species = cbind(iris_small,species)
> iris_small_species
Sepal.Length Sepal.Width Petal.Length Petal.Width    species
1            5.1         3.5          1.4         0.2     Setosa
2            4.9         3.0          1.4         0.2     Setosa
3            4.7         3.2          1.3         0.2     Setosa
51           7.0         3.2          4.7         1.4  Virginica
52           6.4         3.2          4.5         1.5  Virginica
53           6.9         3.1          4.9         1.5  Virginica
101          6.3         3.3          6.0         2.5 Versicolor
102          5.8         2.7          5.1         1.9 Versicolor
103          7.1         3.0          5.9         2.1 Versicolor
```

CBIND() is good if you want to add data programmatically – say in a loop. If you want to do it ad-hoc, you can dynamically do it using the

2- \$ method.

```> iris_small\$species = species
> iris_small
Sepal.Length Sepal.Width Petal.Length Petal.Width    species
1            5.1         3.5          1.4         0.2     Setosa
2            4.9         3.0          1.4         0.2     Setosa
3            4.7         3.2          1.3         0.2     Setosa
51           7.0         3.2          4.7         1.4  Virginica
52           6.4         3.2          4.5         1.5  Virginica
53           6.9         3.1          4.9         1.5  Virginica
101          6.3         3.3          6.0         2.5 Versicolor
102          5.8         2.7          5.1         1.9 Versicolor
103          7.1         3.0          5.9         2.1 Versicolor
```