How to remove particular values from a data frame in R
R
Interview Questions
Say you are compiling the temperature (in Celsius) in Chicago from 5
sensors each located at 5 geographically different locations. You might
want to perform different kinds of analysis on it. So, you put it in a
data frame as 5 different features
# Data from 5 sensors for the week
> s1 = c(12.4, 13.5, 15.6, 20, 21.5, 13.6, 12.4)
> s2 = c(12.3, 13.9, 15.3, 20.4, 21.9, 13.4, 12.7)
> s3 = c(11.9, 12.8, 15.4, 19.8, 21.3, 13.8, 12.2)
> s4 = c(12.7, 13.2, 14.7, 20.9, 20.9, 13.2, 12.1)
> s5 = c(12.1, 13.1, -100, 19.8, 20.3, 12.9, 12.3)
# Create a data frame
> temp_week = data.frame(s1,s2,s3,s4,s5)
s1 s2 s3 s4 s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7 -100
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3
As you can see there is a wierd data point – say all malfunctioning
sensors send -100 as the output. So, you know for sure that -100 needs
to be eliminated and replaced with NA for now.
A simple way to do it is by just saying
But, you wouldn’t know which sensor failed right ? Imagine this were
thousands of rows of data. You need a simple way to replace all
malfunctioning sensor data ( -100 value ) with NA
Step 1 – Figure out which value in each column has -100. We are starting with the 5th column just for convenience
# The third row of the fifth sensor is failing
> temp_week$s5 == -100
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Step 2 – Send this vector of T/F as the index to the data frame column will return just that element
> temp_week$s5[temp_week$s5 == -100]
[1] -100
Step 3 – Now that we know how to identify the element in a column , set it to NA
> temp_week$s5[temp_week$s5 == -100] = NA
> temp_week$s5
[1] 12.1 13.1 NA 19.8 20.3 12.9 12.3
Step 4 – We can repeat this across all the columns. Let’s use a for loop
for(i in names(temp_week)){
temp_week$i[temp_week$i == -100] = NA
}
Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) :
replacement has 0 rows, data has 7
Why do you see an error ? This is because temp$i is not being
replaced with temp$s1 from the names of the columns. This is because the
$ function in data frame works like a string literal and cannot be a
dynamically changing variable. A more programmatic way to access columns
of a data frame in via the double square brackets [[ ]]. Let’s try this
again.
for(i in names(temp_week)){
temp_week[[i]][temp_week[[i]] == -100] = NA
}
> temp_week
s1 s2 s3 s4 s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7 NA
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3
A shorter way would be to use lapply() function, but a for loop explains what is happening much better.
So far so good. Let’s go one step further and replace NA with the
average value of the remaining sensors for the day. That way, we can
better use the available data (otherwise, we would probably have to
throw away that row).
Replace NA with the average of the rows
Step 1 – Find out the mean for each row.
# Find the mean for each row using rowMeans method
row_means = rowMeans(temp_week, na.rm=TRUE)
> row_means
[1] 12.28 13.30 15.25 20.18 21.18 13.38 12.34
Step 2 – Replace the NA with the average of each row
# column by column manually
temp_week$s5[is.na(temp_week$s5)] = row_means[is.na(temp_week$s5)]
# or via a loop ( This looks scary at first, because of the brackets
# but it is just the statement above with $s5 generalized into a column name
for ( name in colnames(temp_week)){
temp_week[[name]][is.na(temp_week[[name]])] =
row_means[is.na(temp_week[[name]])]
}