How to remove particular values from a data frame in R

How to remove particular values from a data frame in R


  R Interview Questions

Say you are compiling the temperature (in Celsius) in Chicago from 5 sensors each located at 5 geographically different locations. You might want to perform different kinds of analysis on it. So, you put it in a data frame as 5 different features

# Data from 5 sensors for the week
> s1 = c(12.4, 13.5, 15.6, 20, 21.5, 13.6, 12.4)
> s2 = c(12.3, 13.9, 15.3, 20.4, 21.9, 13.4, 12.7)
> s3 = c(11.9, 12.8, 15.4, 19.8, 21.3, 13.8, 12.2)
> s4 = c(12.7, 13.2, 14.7, 20.9, 20.9, 13.2, 12.1)
> s5 = c(12.1, 13.1, -100, 19.8, 20.3, 12.9, 12.3)
# Create a data frame 
> temp_week = data.frame(s1,s2,s3,s4,s5)
    s1   s2   s3   s4   s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7 -100
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3

As you can see there is a wierd data point – say all malfunctioning sensors send -100 as the output. So, you know for sure that -100 needs to be eliminated and replaced with NA for now.

A simple way to do it is by just saying

> temp_week[3,5]=NA

But, you wouldn’t know which sensor failed right ? Imagine this were thousands of rows of data. You need a simple way to replace all malfunctioning sensor data ( -100 value ) with NA

Step 1 – Figure out which value in each column has -100. We are starting with the 5th column just for convenience

# The third row of the fifth sensor is failing
> temp_week$s5 == -100
[1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE

Step 2 – Send this vector of T/F as the index to the data frame column will return just that element

> temp_week$s5[temp_week$s5 == -100]
[1] -100

Step 3 – Now that we know how to identify the element in a column , set it to NA

> temp_week$s5[temp_week$s5 == -100] = NA
> temp_week$s5
[1] 12.1 13.1   NA 19.8 20.3 12.9 12.3

Step 4 – We can repeat this across all the columns. Let’s use a for loop

for(i in names(temp_week)){
  temp_week$i[temp_week$i == -100] = NA
}
Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) : 
  replacement has 0 rows, data has 7

Why do you see an error ? This is because temp$i is not being replaced with temp$s1 from the names of the columns. This is because the $ function in data frame works like a string literal and cannot be a dynamically changing variable. A more programmatic way to access columns of a data frame in via the double square brackets [[ ]]. Let’s try this again.

for(i in names(temp_week)){
  temp_week[[i]][temp_week[[i]] == -100] = NA
}
> temp_week
    s1   s2   s3   s4   s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7   NA
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3

A shorter way would be to use lapply() function, but a for loop explains what is happening much better.

So far so good. Let’s go one step further and replace NA with the average value of the remaining sensors for the day. That way, we can better use the available data (otherwise, we would probably have to throw away that row).

Replace NA with the average of the rows

Step 1 – Find out the mean for each row.

# Find the mean for each row using rowMeans method
row_means = rowMeans(temp_week, na.rm=TRUE)
> row_means
[1] 12.28 13.30 15.25 20.18 21.18 13.38 12.34

Step 2 – Replace the NA with the average of each row

# column by column manually
temp_week$s5[is.na(temp_week$s5)] = row_means[is.na(temp_week$s5)]

# or via a loop ( This looks scary at first, because of the brackets
# but it is just the statement above with $s5 generalized into a column name
for ( name in colnames(temp_week)){
  temp_week[[name]][is.na(temp_week[[name]])] =
        row_means[is.na(temp_week[[name]])]
}

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: