How to Drop Multiple Columns From a Data Frame Using Dplyr
- How to Set Up the R Session
-
Use
dplyr
to Drop Multiple Columns by Name Directly in R -
Use
dplyr
to Drop Multiple Columns Using a Character Vector in R -
Use
dplyr
to Drop Consecutive Columns in R -
Use
dplyr
to Drop Columns Using Pattern Matching Functions in R -
Use
dplyr
to Drop Column Names in a Numeric Range in R -
Use
dplyr
to Drop Multiple Columns Using a Function in R - Conclusion
When working with tabular data, we often need to select columns for display. We can either select the columns we want to display or remove the columns that we do not want to display.
This article will learn various ways to use the select()
function of the dplyr
package to drop multiple columns from a data frame.
How to Set Up the R Session
The dplyr
is an R package for performing common data manipulation tasks. The select()
function of dplyr
is designed to select columns from a data frame.
The !
operator is used to take the complement of a set of variables. It will help us drop columns using the select()
function.
We will load the dplyr
package in the following code, create a data frame, and then select two particular columns from this data frame. The dplyr
package can be loaded directly or by loading the tidyverse
package.
We will create a data frame with eight columns and three rows.
We will use the pipe operator %>%> %
to make our code readable. This operator helps us avoid nesting functions and creating/saving intermediate results as objects.
The select()
function takes the data frame’s name followed by the columns’ names (or positions) to select. In the example code in this article, we will supply the data frame’s name using the pipe operator.
Example Code:
# Load the dplyr package directly.
# Alternately, load the entire tidyverse by running the following one line of code.
# library(tidyverse) # Un-comment to run.
library(dplyr)
# We will create a small data frame for this article.
Col1 = c(10, 11, 12)
Col2 = c(20, 21, 22)
Col7 = c(70, 71, 72)
Col8 = c(80, 81, 82)
dplyrA = c('dA1', 'dA2', 'dA3')
dplyrAA = c('AA1', 'AA2', 'AA3')
Bdplyr = c('dB1', 'dB2', 'dB3')
BBdplyr = c('BB1', 'BB2', 'BB3')
dplyr_df = data.frame(Col1, Col2, Col7, Col8, dplyrA, dplyrAA, Bdplyr, BBdplyr)
# Check the type of object that we created.
class(dplyr_df)
# Display the data frame.
dplyr_df
# Select two columns using their names.
dplyr_df %>% select(Col2, BBdplyr)
Output of the last command:
> dplyr_df %>% select(Col2, BBdplyr)
Col2 BBdplyr
1 20 BB1
2 21 BB2
3 22 BB3
When column names are listed directly in the select()
function, they are specified like variables. Unlike strings, they are not given in quotes.
Use dplyr
to Drop Multiple Columns by Name Directly in R
There are three equivalent ways to drop multiple columns by name directly.
In the first method, we will combine column names into a vector of variables using the c()
function. To drop all the columns in this vector, we will use the !
operator. It gives the complement of those variables.
In the second method, we take the intersection of the complement of each column that we want to drop. The &
operator gives us an intersection.
In the third method, we complement a union of column names. The |
operator gives us a union.
Example Code:
# Select the complement of a vector of column names.
dplyr_df %>% select(!c(Col1, dplyrA, BBdplyr))
# Select the intersection of the complement of each column.
dplyr_df %>% select(!Col1 & !dplyrA & !BBdplyr)
# Select the complement of the union of column names.
dplyr_df %>% select(!(Col1 | dplyrA | BBdplyr))
Output (identical for all three methods):
Col2 Col7 Col8 dplyrAA Bdplyr
1 20 70 80 AA1 dB1
2 21 71 81 AA2 dB2
3 22 72 82 AA3 dB3
The select()
function also takes column positions. It is equivalent to using column names directly.
Example Code:
# Select the complement of a vector of column positions.
dplyr_df %>% select(!c(1, 5, 8))
# Select the intersection of the complement of each column.
dplyr_df %>% select(!1 & !5 & !8)
# Select the complement of the union of column positions.
dplyr_df %>% select(!(1 | 5 | 8))
Use dplyr
to Drop Multiple Columns Using a Character Vector in R
Rather than directly specify column names in the select()
function, we can save the column names in an object and use that object in the function.
However, there are two key differences when this approach is used.
- The column names need to be stored as a character vector, not a vector of variable names. In other words, the names have to be strings surrounded by quotes.
- We will need to use a selection helper function, either
all_of()
orany_of()
. We will useall_of()
in the example code.
Example Code:
# Create a character vector using the names of the columns to remove.
# Note the quotes around the column names.
to_remove = c('Col2', 'Col7', 'dplyrAA', 'Bdplyr')
# Select the complement of the column names in the vector 'to_remove'.
dplyr_df %>% select(!all_of(to_remove))
Output:
> dplyr_df %>% select(!all_of(to_remove))
Col1 Col8 dplyrA BBdplyr
1 10 80 dA1 BB1
2 11 81 dA2 BB2
3 12 82 dA3 BB3
Use dplyr
to Drop Consecutive Columns in R
To drop consecutive columns, we will use the :
operator. We can use column names or column positions. Both give the same output.
We will remove columns 2
to 7
from our data frame; columns from Col2
to Bdplyr
. We will be left with the first and last columns, Col1
and BBdplyr
.
Example Code:
# Drop a range of columns specified by column numbers.
dplyr_df %>% select(!(2:7))
# Drop a range of columns specified by column names.
# Note that the variable names are not in quotes.
dplyr_df %>% select(!(Col2:Bdplyr))
Output is identical for both commands:
Col1 BBdplyr
1 10 BB1
2 11 BB2
3 12 BB3
Use dplyr
to Drop Columns Using Pattern Matching Functions in R
We can use pattern matching functions to drop multiple columns. These functions take a string or a vector of strings as an argument.
They return all columns that match the pattern. To drop those columns, we use the !
operator.
It is important to note that, by default, these functions are not case-sensitive. So the string cat
is matched by cat
, Cat
, CAT
, etc.
- The
starts_with()
function matches column names from the start of the names. - The
ends_with()
function matches column names from the end of the names. - The
contains()
function matches any part of the column names.
We will use strings expected to return at least two names in the example code. We can check the output to verify that the function worked as expected.
Example Code:
# Look at the column names in our data frame.
names(dplyr_df)
# Four columns start with 'Col'. We will drop them.
dplyr_df %>% select(!starts_with('Col'))
# There are two column names that end with 'A'. We will drop them.
dplyr_df %>% select(!ends_with('A'))
# There are four column names that contain the string 'dplyr'.
# We will drop these four columns.
dplyr_df %>% select(!contains('dplyr'))
# We can give a vector of strings as an argument to these functions.
# We will drop columns that start with 'Co' or 'B'.
# 6 columns should get dropped.
dplyr_df %>% select(!starts_with(c('Co', 'B')))
The output of the first and last commands:
> # Look at the column names in our data frame.
> names(dplyr_df)
[1] "Col1" "Col2" "Col7" "Col8" "dplyrA" "dplyrAA" "Bdplyr" "BBdplyr"
> dplyr_df %>% select(!starts_with(c('Co', 'B')))
dplyrA dplyrAA
1 dA1 AA1
2 dA2 AA2
3 dA3 AA3
Besides these three functions, dplyr
provides another pattern matching helper function for a regular expression.
The matches()
function takes a regular expression as an argument. It’s not case-sensitive by default.
For example, we will drop columns with an l
followed immediately by 7
or y
anywhere in their name. Users need to be familiar with regular expressions to take advantage of this function.
Example Code:
dplyr_df %>% select(!matches('l+[7y]'))
Output:
> dplyr_df %>% select(!matches('l+[7y]'))
Col1 Col2 Col8
1 10 20 80
2 11 21 81
3 12 22 82
Use dplyr
to Drop Column Names in a Numeric Range in R
Sometimes, we may have a data frame with column names that begin with a fixed string and end with numbers. dplyr
provides the num_range()
selection helper function to help us select and drop columns that share a common prefix and end in a specified numeric range.
To illustrate, we will first create a data frame with six columns. The first argument to num_range()
is the prefix, and the second is the numeric range specified with the :
operator.
The !
operator (complement) helps us drop the selected columns.
Example Code:
# Create vectors of the same length.
MyVar10 = seq(1, 5)
MyVar11 = seq(6, 10)
MyVar12 = seq(11, 15)
MyVar13 = seq(16, 20)
MyVar14 = seq(21, 25)
MyVar15 = seq(26, 30)
# Combine the vectors into a data frame.
num_df = data.frame(MyVar10, MyVar11, MyVar12, MyVar13, MyVar14, MyVar15)
num_df
# Drop columns that end in the range 12 to 14.
num_df %>% select(!num_range('MyVar', 12:14))
The output of the last two commands:
> num_df
MyVar10 MyVar11 MyVar12 MyVar13 MyVar14 MyVar15
1 1 6 11 16 21 26
2 2 7 12 17 22 27
3 3 8 13 18 23 28
4 4 9 14 19 24 29
5 5 10 15 20 25 30
> # Drop columns that end in the range 12 to 14.
> num_df %>% select(!num_range('MyVar', 12:14))
MyVar10 MyVar11 MyVar15
1 1 6 26
2 2 7 27
3 3 8 28
4 4 9 29
5 5 10 30
Use dplyr
to Drop Multiple Columns Using a Function in R
The where()
helper function applies a function that returns TRUE
or FALSE
to the column data. The columns for which the function returns TRUE
are selected.
As usual, to drop columns, we use the !
operator.
In the example, we use a simple custom function to select all columns with more than 10. The code drops these and returns the remaining columns.
This example code works because all columns in the data frame are numeric. With real data, the function will have to be more comprehensive.
Example Code:
# Since all columns are numeric, there is no error.
# Otherwise, calculate the mean only for numeric columns.
num_df %>% select(!where(function(y) {mean(y)>10}))
Output:
> num_df %>% select(!where(function(y) {mean(y)>10}))
MyVar10 MyVar11
1 1 6
2 2 7
3 3 8
4 4 9
5 5 10
References and Help
The dplyr
package is part of the Tidyverse collection of packages.
The select()
function is documented at the web page Subset columns using their names and types. The selection helper functions are all linked to this web page.
The tidyselect
package forms the backend of the dplyr
selection functions. Its Selection Language web page gives more details and examples.
The pipe operator, %>%
, is provided by the magrittr package of the tidyverse.
If the select()
function is not working as expected, we must verify that no other loaded package has a select()
function. A quick way to check if this is the case is to use the package name as a prefix when using the function: dplyr::select()
.
If it works with the package prefix, we have two options: always use the prefix or load dplyr
(or tidyverse
) last. Functions in packages loaded later mask the same name’s functions in earlier packages.
For help with R functions in R Studio, click Help > Search R Help
and type the function name in the search box without parentheses.
Alternately, type a question mark followed by the function name at the command prompt in the R Console. For example, ?select
.
Conclusion
The dplyr
package provides many selection helper functions and operators which allow us to drop multiple columns from a data frame using a single line of code.
We use the complement operator !
to drop the selected columns in all cases.