How to remove duplicate rows from a data frame in R
Click here for more R Interview Questions
1 – DUPLICATED ()
The last 3 rows are duplicates of rows 4, 5 and 6.
> iris_dup
Sepal.Length Sepal.Width Petal.Length petal.Width species
1 5.1 3.5 1.4 0.2 Setosa
2 4.9 3.0 1.4 0.2 Setosa
3 4.7 3.2 1.3 0.2 Setosa
51 7.0 3.2 4.7 1.4 Virginica
52 6.4 3.2 4.5 1.5 Virginica
53 6.9 3.1 4.9 1.5 Virginica
101 6.3 3.3 6.0 2.5 Versicolor
102 5.8 2.7 5.1 1.9 Versicolor
103 7.1 3.0 5.9 2.1 Versicolor
511 7.0 3.2 4.7 1.4 Virginica
521 6.4 3.2 4.5 1.5 Virginica
531 6.9 3.1 4.9 1.5 Virginica
Step 1 – Find out the duplicates
duplicated(iris_dup)
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE TRUE TRUE
Step 2 – Use this vector to select only the duplicate rows
> iris_dup[duplicated(iris_dup),]
Sepal.Length Sepal.Width Petal.Length petal.Width species
511 7.0 3.2 4.7 1.4 Virginica
521 6.4 3.2 4.5 1.5 Virginica
531 6.9 3.1 4.9 1.5 Virginica
or have only the non-duplicate rows.
> iris_dup[!duplicated(iris_dup),]
Sepal.Length Sepal.Width Petal.Length petal.Width species
1 5.1 3.5 1.4 0.2 Setosa
2 4.9 3.0 1.4 0.2 Setosa
3 4.7 3.2 1.3 0.2 Setosa
51 7.0 3.2 4.7 1.4 Virginica
52 6.4 3.2 4.5 1.5 Virginica
53 6.9 3.1 4.9 1.5 Virginica
101 6.3 3.3 6.0 2.5 Versicolor
102 5.8 2.7 5.1 1.9 Versicolor
103 7.1 3.0 5.9 2.1 Versicolor
2. UNIQUE () function
This is much more straightforward.
> unique(iris_dup)
Sepal.Length Sepal.Width Petal.Length petal.Width species
1 5.1 3.5 1.4 0.2 Setosa
2 4.9 3.0 1.4 0.2 Setosa
3 4.7 3.2 1.3 0.2 Setosa
51 7.0 3.2 4.7 1.4 Virginica
52 6.4 3.2 4.5 1.5 Virginica
53 6.9 3.1 4.9 1.5 Virginica
101 6.3 3.3 6.0 2.5 Versicolor
102 5.8 2.7 5.1 1.9 Versicolor
103 7.1 3.0 5.9 2.1 Versicolor