How to subset a vector in R

How to subset a vector in R


  R Interview Questions

Go here for a full list of R Interview Questions

Sub setting a vector in R is simple. You just have to understand how to address each individual elements in the vector. The same principle applies to sub setting a data frame in R except with vectors, the addresses of its elements are much easier – it’s just 1D.

Being able to address the elements in a vector is key to be able to subset it. Consider the following vector that is part of base R.

There is 50 states in this vector and sub setting these is a matter of just addressing them.

> # state.abb vector that comes with base R
> state.abb
 [1] "AL" "AK" "AZ" "AR" "CA" "CO" "CT" "DE" "FL" "GA" "HI" "ID" "IL" "IN" "IA" "KS" "KY" "LA" "ME"
[20] "MD" "MA" "MI" "MN" "MS" "MO" "MT" "NE" "NV" "NH" "NJ" "NM" "NY" "NC" "ND" "OH" "OK" "OR" "PA"
[39] "RI" "SC" "SD" "TN" "TX" "UT" "VT" "VA" "WA" "WV" "WI" "WY"

The screenshot above just shows 26 states – just imagine a long list of 50 to the right.

Sub setting by range

> # First 5 states (by index range)
> state.abb[1:5]
[1] "AL" "AK" "AZ" "AR" "CA"

Sub setting with negative indices and range

> # First 5 states (by negative index range)
> state.abb[-6:-50]
[1] "AL" "AK" "AZ" "AR" "CA"
> 
> # The other way is fine too
> state.abb[-50:-6]
[1] "AL" "AK" "AZ" "AR" "CA"

Sub setting with specific indices

Get the 5th, 10th and 15th states

> # Get the 5th, 10th and 15th states
> state.abb[ c(5, 10, 15)]
[1] "CA" "GA" "IA"

Sub setting with names

There is another vector – state.names that lists all the state names. Create a new vector us_states with the values as state names and names as the state abbreviation

> # set names to the states vector
> us_states = setNames(state.name, state.abb)
> us_states
              AL               AK               AZ               AR               CA 
       "Alabama"         "Alaska"        "Arizona"       "Arkansas"     "California" 
              CO               CT               DE               FL               GA 
      "Colorado"    "Connecticut"       "Delaware"        "Florida"        "Georgia" 
              HI               ID               IL               IN               IA 
        "Hawaii"          "Idaho"       "Illinois"        "Indiana"           "Iowa" 
              KS               KY               LA               ME               MD 
        "Kansas"       "Kentucky"      "Louisiana"          "Maine"       "Maryland" 
              MA               MI               MN               MS               MO 
 "Massachusetts"       "Michigan"      "Minnesota"    "Mississippi"       "Missouri" 
              MT               NE               NV               NH               NJ 
       "Montana"       "Nebraska"         "Nevada"  "New Hampshire"     "New Jersey" 
              NM               NY               NC               ND               OH 
    "New Mexico"       "New York" "North Carolina"   "North Dakota"           "Ohio" 
              OK               OR               PA               RI               SC 
      "Oklahoma"         "Oregon"   "Pennsylvania"   "Rhode Island" "South Carolina" 
              SD               TN               TX               UT               VT 
  "South Dakota"      "Tennessee"          "Texas"           "Utah"        "Vermont" 
              VA               WA               WV               WI               WY 
      "Virginia"     "Washington"  "West Virginia"      "Wisconsin"        "Wyoming" 

Now, let’s get the state names with the abbreviations A

> # Get the states with abbreviation AZ, CO and DE
> us_states[c("AZ","CO","DE")]
        AZ         CO         DE 
 "Arizona" "Colorado" "Delaware" 

Sub setting with filter conditions

Old Faithful is a geyser in the US that spits hot water periodically. The wait time for the fountain and the eruption time is documented in a data frame “faithful”.

> waiting = faithful$waiting
> waiting
  [1] 79 54 74 62 85 55 88 85 51 85 54 84 78 47 83 52 62 84 52 79 51 47 78 69 74 83 55 76 78 79 73
 [32] 77 66 80 74 52 48 80 59 90 80 58 84 58 73 83 64 53 82 59 75 90 54 80 54 83 71 64 77 81 59 84
 [63] 48 82 60 92 78 78 65 73 82 56 79 71 62 76 60 78 76 83 75 82 70 65 73 88 76 80 48 86 60 90 50
 [94] 78 63 72 84 75 51 82 62 88 49 83 81 47 84 52 86 81 75 59 89 79 59 81 50 85 59 87 53 69 77 56
[125] 88 81 45 82 55 90 45 83 56 89 46 82 51 86 53 79 81 60 82 77 76 59 80 49 96 53 77 77 65 81 71
[156] 70 81 93 53 89 45 86 58 78 66 76 63 88 52 93 49 57 77 68 81 81 73 50 85 74 55 77 83 83 51 78
[187] 84 46 83 55 81 57 76 84 77 81 87 77 51 78 60 82 91 53 78 46 77 84 49 83 71 80 49 75 64 76 53
[218] 94 55 76 50 82 54 75 78 79 78 78 70 79 70 54 86 50 90 54 54 77 79 64 75 47 86 63 85 82 57 82
[249] 67 74 54 83 73 73 88 80 71 83 56 79 78 84 58 83 43 60 75 81 46 90 46 74

Let’s pick up all the wait times > 90 minutes

> # Get all the wait times > 90 minutes
> waiting[ waiting > 90 ]
[1] 92 96 93 93 91 94

Simple enough – right ? The waiting > 90 comparison test produces a logical vector that is being passed on as indices to the vector. Only the TRUE index values are returned. You can combine as many logical tests as you like. 

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: