How to Create and Interpret Dummy Variables in R
-
Install the
fastDummies
Package in R -
Use the
dummy_cols()
Function to Create Dummy Columns in R - Interpret Dummy Variables
This article will teach how to create dummy variables using the dummy_cols()
function of the fastDummies
package in R. The words dummy variable and dummy column will be used interchangeably.
Install the fastDummies
Package in R
We need to install the fastDummies
package and load it.
Example Code:
# Install the fastDummies package.
install.packages("fastDummies")
# Load the fastDummies package.
library(fastDummies)
We will now create a small data frame with a categorical variable.
Example Code:
# Vectors.
cv = c("Bd", "Ba", "F", NA, "F", "F", "Ba")
nv = seq(1:7)
# Data Frame
orig_datf = data.frame(Num_V = nv,Cat_V=as.factor(cv))
# View the data frame.
orig_datf
str(orig_datf)
Output:
> str(orig_datf)
'data.frame': 7 obs. of 2 variables:
$ Num_V: int 1 2 3 4 5 6 7
$ Cat_V: Factor w/ 3 levels "Ba","Bd","F": 2 1 3 NA 3 3 1
As displayed, our data frame has a categorical variable with 3-factor levels.
R assigns factor levels based on alphabetical order. This detail matters when we create dummy variables.
Use the dummy_cols()
Function to Create Dummy Columns in R
If we do not specify the columns from which to create dummy variables, the function creates dummy columns from all factor or character type columns.
Example Code:
new_datf_default_all = dummy_cols(orig_datf)
new_datf_default_all
names(new_datf_default_all)
Output:
> names(new_datf_default_all)
[1] "Num_V" "Cat_V" "Cat_V_Ba" "Cat_V_Bd" "Cat_V_F" "Cat_V_NA"
Observe the following in the list of columns.
- Because the categorical variable had 3 categories, we see 3 new columns.
- Because our categorical column had missing values (NA), we also have one column indicating NAs with the value 1. All the other dummy columns have NA, whereas the original column had an NA.
Create Dummy Variables From Selected Columns in R
To create dummy variables from only selected columns, we can use the select_columns
argument. We can pass a single column name as a string and multiple columns in a vector.
Example Code:
# Pass a single column.
new_datf_select_cols = dummy_cols(orig_datf, select_columns = "Cat_V")
# Pass multiple columns using a vector.
new_datf_select_cols = dummy_cols(orig_datf, select_columns = c("Cat_V"))
Remove One Column to Avoid Multicollinearity in R
When we create dummy variables using all levels of a factor column, the new columns are linearly dependent. In other words, for each row, given the values of all other columns, we can predict the value of the last column.
This affects the results of statistical analysis (such as linear regression). Therefore, we need to remove one of the dummy columns for each original column from which we are creating dummy variables.
The dummy_cols()
function gives us two options. We can set either remove_first_dummy = TRUE
, or remove_most_frequent_dummy = TRUE
.
The following code examines both options.
Example Code:
# Remove first.
new_datf_remove_first = dummy_cols(orig_datf, remove_first_dummy = TRUE)
# After removing first.
names(new_datf_remove_first)
# Remove most frequent.
new_datf_remove_most_frequent = dummy_cols(orig_datf, remove_most_frequent_dummy = TRUE)
# After removing most frequent
names(new_datf_remove_most_frequent)
Output:
> # After removing first.
> names(new_datf_remove_first)
[1] "Num_V" "Cat_V" "Cat_V_Bd" "Cat_V_F" "Cat_V_NA"
> # After removing most frequent
> names(new_datf_remove_most_frequent)
[1] "Num_V" "Cat_V" "Cat_V_Ba" "Cat_V_Bd"
Notice the following in the output of the two commands.
-
The argument
remove_first_dummy = TRUE
removed the column corresponding to the first level of the factor. -
The argument
remove_most_frequent_dummy = TRUE
dropped the column corresponding to the level that appeared most frequently in the original column.However, it also had the effect of dropping the column that showed where the NAs were. Even setting
ignore_na = FALSE
did not affect the output.
We can use the following workaround if we want to keep the NA column and drop the most frequent factor.
- First,
relevel
the factor column using therelevel()
function. Make the most frequent value the first level. - Then use
remove_first_dummy = TRUE
.
Example Code:
releveled_datf = orig_datf
# Relevel the desired column manually.
releveled_datf$Cat_V = relevel(releveled_datf$Cat_V, ref = "F")
# View the new levels.
levels(releveled_datf$Cat_V)
# NOW, remove first.
releveled_datf_remove_first = dummy_cols(releveled_datf, remove_first_dummy = TRUE)
# After removing first.
names(releveled_datf_remove_first)
Output:
> levels(releveled_datf$Cat_V)
[1] "F" "Ba" "Bd"
> # After removing first.
> names(releveled_datf_remove_first)
[1] "Num_V" "Cat_V" "Cat_V_Ba" "Cat_V_Bd" "Cat_V_NA"
Interpret Dummy Variables
In the linear regression setting, the intercept coefficient is said to include the effect of the base level (or the level that was removed) of the original column. Remember that we removed one column when we created the dummy columns.
The removed factor is interpreted as having the value 0 for all the dummy columns created from the same original column. Therefore, its effect is included in the intercept.
The coefficient for each dummy column corresponds to the difference caused by that factor level compared to the base level. This can be a positive or negative effect compared to the baseline, depending on the value of this coefficient.
Because of this interpretation, it is useful to drop the column corresponding to the most frequent factor.