How to Remove Duplicate Rows by Column in R
-
Use the
distinct
Function of thedplyr
Package to Remove Duplicate Rows by Column in R -
Use
group_by
,filter
andduplicated
Functions to Remove Duplicate Rows by Column in R -
Use
group_by
andslice
Functions to Remove Duplicate Rows by Column in R
This article will introduce how to remove duplicate rows by column in R.
Use the distinct
Function of the dplyr
Package to Remove Duplicate Rows by Column in R
The dplyr
package provides the distinct
function, one of the most common data manipulation libraries used in R language. distinct
selects unique rows in the given data frame. It takes the data frame as the first argument and then the variables that need to be considered during the selection. Multiple column variables can be supplied for filtering the unique rows, but in the following code snippet, we demonstrate the single variable examples. The third argument is optional and has the default value - FALSE
, but if the user explicitly passes TRUE
, the function will keep all variables in the data frame after filtering. Note that dplyr
uses an operator function called pipes of form - %>%
, which is interpreted as supplying the left variable as the first argument of the right function. Namely, x %?% f(y)
notation becomes f(x, y)
.
library(dplyr)
df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
variant = c("a", "b", "c", "d", "e", "f", "g", "h"))
t1 <- df1 %>% distinct(id, .keep_all = TRUE)
t2 <- df1 %>% distinct(gender, .keep_all = TRUE)
t3 <- df1 %>% distinct(variant, .keep_all = TRUE)
df2 <- mtcars
tmp1 <- df2 %>% distinct(cyl, .keep_all = TRUE)
tmp2 <- df2 %>% distinct(mpg, .keep_all = TRUE)
Use group_by
, filter
and duplicated
Functions to Remove Duplicate Rows by Column in R
Another solution to remove duplicate rows by column values is to group the data frame with the column variable and then filter elements using filter
and duplicated
functions. The first step is done with the group_by
function that is part of the dplyr
package. Next, the output of the previous operation is redirected to the filter
function to eliminate duplicate rows.
library(dplyr)
df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
variant = c("a", "b", "c", "d", "e", "f", "g", "h"))
t1 <- df1 %>% group_by(id) %>% filter (! duplicated(id))
t2 <- df1 %>% group_by(gender) %>% filter (! duplicated(gender))
t3 <- df1 %>% group_by(variant) %>% filter (! duplicated(variant))
df2 <- mtcars
tmp3 <- df2 %>% group_by(cyl) %>% filter (! duplicated(cyl))
tmp4 <- df2 %>% group_by(mpg) %>% filter (! duplicated(mpg))
Use group_by
and slice
Functions to Remove Duplicate Rows by Column in R
Alternatively, one can utilize the group_by
function together with slice
to remove duplicate rows by column values. slice
is also part of the dplyr
package, and it selects rows by index. Interestingly, when the data frame is grouped, then slice
will select the rows on the given index in each group, as demonstrated in the following sample code.
library(dplyr)
df1 <- data.frame(id = c(1, 2, 2, 3, 3, 4, 5, 5),
gender = c("F", "F", "M", "F", "B", "B", "F", "M"),
variant = c("a", "b", "c", "d", "e", "f", "g", "h"))
t1 <- df1 %>% group_by(id) %>% slice(1)
t2 <- df1 %>% group_by(gender) %>% slice(1)
t3 <- df1 %>% group_by(variant) %>% slice(1)
df2 <- mtcars
tmp5 <- df2 %>% group_by(cyl) %>% slice(1)
tmp6 <- df2 %>% group_by(mpg) %>% slice(1)
Founder of DelftStack.com. Jinku has worked in the robotics and automotive industries for over 8 years. He sharpened his coding skills when he needed to do the automatic testing, data collection from remote servers and report creation from the endurance test. He is from an electrical/electronics engineering background but has expanded his interest to embedded electronics, embedded programming and front-/back-end programming.
LinkedIn Facebook