How do you search/replace text in a data frame

How do you search/replace text in a data frame


  R Interview Questions

Finding out if a data frame contains a particular piece of text or feature value is not possible with the standard filters. For example, take the iris data set ( abridged )

> iris_a = iris[c(1:3,51:53,101:103),]
> iris_a
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica

Say the “Species” field has so many different kinds of flower species. How do you find out if there is a flower that starts with “vir” ?

> iris_a[iris$Species == "vir*",]
[1] Sepal.Length Sepal.Width  Petal.Length Petal.Width  Species     
<0 rows> (or 0-length row.names)

Obviously, this does not work. That is because the regular search and filter features of a data frame are not based on regular expressions. Try grep() instead

# grep is looking for text at a particular column ( a vector ) 
# and returning all the rows in the vector with a match
> rows = grep("vir",iris_a$Species)
> rows
[1] 7 8 9
> iris_a[rows,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica

Since the first parameter in the grep function is a regular expression, this works too.

# It does not matter where in the word the regular expression sub string is
> rows = grep("gin",iris_a$Species)
> iris_a[rows,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica
# If you want to ignore case, use option ignore.case = TRUE
# For ex., this would still pick up all the virginica species
> rows = grep("gIn",iris_a$Species)
> iris_a[rows,]
    Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
101          6.3         3.3          6.0         2.5 virginica
102          5.8         2.7          5.1         1.9 virginica
103          7.1         3.0          5.9         2.1 virginica

There are many ways in which this search can happen

  • “^vir” – Find all the Species starting with “vir” and not in the middle
  • “.irg” – Find all the species starting with any first letter, but the next set of letters should be irg.

All of these variations can be captured through the use of regular expressions – which is a subject in itself.

grep () only works on vectors – that is the reason why we are supplying just the species column as a parameter. grep() does not work on multiple vectors at once ( like a data frame ). 

How to replace text in a data frame

Once you search for data, you might want to replace some of the text. How do you do it ? use the R function – gsub() . For example, let’s look at a subset of iris data.

> iris_small
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
51           7.0         3.2          4.7         1.4 versicolor
52           6.4         3.2          4.5         1.5 versicolor
53           6.9         3.1          4.9         1.5 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica

Say, you want to replace all species starting with v with a capital “V”. Try this.

> iris_small$Species = gsub("^v","V",iris_small$Species)
> iris_small
    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
1            5.1         3.5          1.4         0.2     setosa
2            4.9         3.0          1.4         0.2     setosa
3            4.7         3.2          1.3         0.2     setosa
51           7.0         3.2          4.7         1.4 Versicolor
52           6.4         3.2          4.5         1.5 Versicolor
53           6.9         3.1          4.9         1.5 Versicolor
101          6.3         3.3          6.0         2.5  Virginica
102          5.8         2.7          5.1         1.9  Virginica
103          7.1         3.0          5.9         2.1  Virginica

You can very well use any regular expression in the first argument to gsub(). There is another function called sub() that only substitutes the first occurrence, but it is rarely used.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d bloggers like this: