How to remove particular values from a data frame in R
Say you are compiling the temperature (in Celsius) in Chicago from 5
sensors each located at 5 geographically different locations. You might
want to perform different kinds of analysis on it. So, you put it in a
data frame as 5 different features
# Data from 5 sensors for the week
> s1 = c(12.4, 13.5, 15.6, 20, 21.5, 13.6, 12.4)
> s2 = c(12.3, 13.9, 15.3, 20.4, 21.9, 13.4, 12.7)
> s3 = c(11.9, 12.8, 15.4, 19.8, 21.3, 13.8, 12.2)
> s4 = c(12.7, 13.2, 14.7, 20.9, 20.9, 13.2, 12.1)
> s5 = c(12.1, 13.1, -100, 19.8, 20.3, 12.9, 12.3)
# Create a data frame
> temp_week = data.frame(s1,s2,s3,s4,s5)
s1 s2 s3 s4 s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7 -100
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3
As you can see there is a wierd data point – say all malfunctioning sensors send -100 as the output. So, you know for sure that -100 needs to be eliminated and replaced with NA for now.
A simple way to do it is by just saying
> temp_week[3,5]=NA
But, you wouldn’t know which sensor failed right ? Imagine this were thousands of rows of data. You need a simple way to replace all malfunctioning sensor data ( -100 value ) with NA
Step 1 – Figure out which value in each column has -100. We are starting with the 5th column just for convenience
# The third row of the fifth sensor is failing
> temp_week$s5 == -100
[1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE
Step 2 – Send this vector of T/F as the index to the data frame column will return just that element
> temp_week$s5[temp_week$s5 == -100]
[1] -100
Step 3 – Now that we know how to identify the element in a column , set it to NA
> temp_week$s5[temp_week$s5 == -100] = NA
> temp_week$s5
[1] 12.1 13.1 NA 19.8 20.3 12.9 12.3
Step 4 – We can repeat this across all the columns. Let’s use a for loop
for(i in names(temp_week)){
temp_week$i[temp_week$i == -100] = NA
}
Error in `$<-.data.frame`(`*tmp*`, "i", value = logical(0)) :
replacement has 0 rows, data has 7
Why do you see an error ? This is because temp$i is not being replaced with temp$s1 from the names of the columns. This is because the $ function in data frame works like a string literal and cannot be a dynamically changing variable. A more programmatic way to access columns of a data frame in via the double square brackets [[ ]]. Let’s try this again.
for(i in names(temp_week)){
temp_week[[i]][temp_week[[i]] == -100] = NA
}
> temp_week
s1 s2 s3 s4 s5
1 12.4 12.3 11.9 12.7 12.1
2 13.5 13.9 12.8 13.2 13.1
3 15.6 15.3 15.4 14.7 NA
4 20.0 20.4 19.8 20.9 19.8
5 21.5 21.9 21.3 20.9 20.3
6 13.6 13.4 13.8 13.2 12.9
7 12.4 12.7 12.2 12.1 12.3
A shorter way would be to use lapply() function, but a for loop explains what is happening much better.
So far so good. Let’s go one step further and replace NA with the average value of the remaining sensors for the day. That way, we can better use the available data (otherwise, we would probably have to throw away that row).
Replace NA with the average of the rows
Step 1 – Find out the mean for each row.
# Find the mean for each row using rowMeans method
row_means = rowMeans(temp_week, na.rm=TRUE)
> row_means
[1] 12.28 13.30 15.25 20.18 21.18 13.38 12.34
Step 2 – Replace the NA with the average of each row
# column by column manually
temp_week$s5[is.na(temp_week$s5)] = row_means[is.na(temp_week$s5)]
# or via a loop ( This looks scary at first, because of the brackets
# but it is just the statement above with $s5 generalized into a column name
for ( name in colnames(temp_week)){
temp_week[[name]][is.na(temp_week[[name]])] =
row_means[is.na(temp_week[[name]])]
}