What strategies did you use to eliminate NA

What strategies did you use to eliminate NA


  R Interview Questions

Substituting with feature average

Step 1 – Get the row mean

> s_data
   s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98  NA 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97
> mean(s_data$s_2,na.rm = TRUE)
[1] 100

Step 2 – Replace the NA with row mean

> s_data[5,"s_2"] = mean(s_data$s_2,na.rm = TRUE)
> s_data
   s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98 100 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97

If you wanted to do this pro grammatically ( for any number of rows,columns ), do this

replaceNA = function(data) { 
  # Loop through each column
  for ( var in 1:ncol(data)){
    # Get the mean odf the column.
    mean = mean ( data[,var], na.rm = TRUE)
    # Replace the NA with the mean
    # is.na ( data[,var]) - Gets the T/F vector where value is NA
    data[ is.na ( data[,var]),var ] = mean
  }  
  # return the dataset
  data
}
> replaceNA(s_data)
   s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98 100 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97

Eliminate the entire observation

If the number of observations with NAs is quite small, the easy way is to get rid of the entire observation.

For example, getting rid of the 6th row above with NA is very easy. Just use the function omit.na()

> s_data
   s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
5   98  NA 101
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97
> na.omit(s_data)
   s_1 s_2 s_3
1  103 102 113
2  101 108 122
3   98 106  88
4  101 102 106
6  108 103  95
7  100  99 106
8   94  80 107
9   97 104  99
10  93  96  97

As you can see, the 5th row is gone.

%d bloggers like this: