How to Get Dummies in Pandas
-
pandas.get_dummies()Method -
Create DataFrame With Dummy Variable Columns Using
pandas.get_dummies()Method -
Set
columnsto Create Dummy Variables for Specified Columns Only -
Set
prefixto Change the Default Name of Dummy Columns
This tutorial explains how we can generate DataFrame with dummy or indicator variables from DataFrame with categorical columns.
pandas.get_dummies() Method
pandas.get_dummies(
data,
prefix=None,
prefix_sep="_",
dummy_na=False,
columns=None,
sparse=False,
drop_first=False,
dtype=None,
)
Create DataFrame With Dummy Variable Columns Using pandas.get_dummies() Method
import pandas as pd
students_df = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303],
"Name": ["Mike", "Christine", "Rob", "Daniel", "Jennifer"],
"Sex": ["Male", "Female", "Male", "Male", "Female"],
}
)
students_df_dummies = pd.get_dummies(students_df)
print("The original DataFrame is:")
print(students_df, "\n")
print("DataFrame with Dummies:")
print(students_df_dummies)
Output:
The original DataFrame is:
Id Name Sex
0 302 Mike Male
1 504 Christine Female
2 708 Rob Male
3 103 Daniel Male
4 303 Jennifer Female
DataFrame with Dummies:
Id Name_Christine Name_Daniel Name_Jennifer Name_Mike Name_Rob Sex_Female Sex_Male
0 302 0 0 0 1 0 0 1
1 504 1 0 0 0 0 1 0
2 708 0 0 0 0 1 0 1
3 103 0 1 0 0 0 0 1
4 303 0 0 1 0 0 1 0
It generates a DataFrame with dummy column names formed by concatenating the original column name and each unique value for the column.
For the Name column, we have five unique values, and hence the Name splits to Name_ plus each unique name in the DataFrame. The dummy columns’ values will be 1 or 0 based on the value in the initial DataFrame.
The row with value of Name column Daniel in the students_df DataFrame will have value 1 for the Name_Daniel column in the students_df_dummies DataFrame while all other name values will have value 0 for the Name_Daniel column in the students_df_dummies DataFrame.
Set columns to Create Dummy Variables for Specified Columns Only
By default, the get_dummies() method will create DataFrame with dummy columns for each column with dtypes object or category. We can set pass the list of the columns as columns argument to specify particular columns.
import pandas as pd
students_df = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303],
"Name": ["Mike", "Christine", "Rob", "Daniel", "Jennifer"],
"Sex": ["Male", "Female", "Male", "Male", "Female"],
}
)
students_df_dummies = pd.get_dummies(students_df, columns=["Sex"])
print("The original DataFrame is:")
print(students_df, "\n")
print("DataFrame with Dummies:")
print(students_df_dummies)
Output:
The original DataFrame is:
Id Name Sex
0 302 Mike Male
1 504 Christine Female
2 708 Rob Male
3 103 Daniel Male
4 303 Jennifer Female
DataFrame with Dummies:
Id Name Sex_Female Sex_Male
0 302 Mike 0 1
1 504 Christine 1 0
2 708 Rob 0 1
3 103 Daniel 0 1
4 303 Jennifer 1 0
It creates dummy variables for the Sex column only.
Set prefix to Change the Default Name of Dummy Columns
import pandas as pd
students_df = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303],
"Name": ["Mike", "Christine", "Rob", "Daniel", "Jennifer"],
"Sex": ["Male", "Female", "Male", "Male", "Female"],
}
)
students_df_dummies = pd.get_dummies(students_df, columns=["Sex"], prefix="Column")
print("The original DataFrame is:")
print(students_df, "\n")
print("DataFrame with Dummies:")
print(students_df_dummies)
Output:
The original DataFrame is:
Id Name Sex
0 302 Mike Male
1 504 Christine Female
2 708 Rob Male
3 103 Daniel Male
4 303 Jennifer Female
DataFrame with Dummies:
Id Name Column_Female Column_Male
0 302 Mike 0 1
1 504 Christine 1 0
2 708 Rob 0 1
3 103 Daniel 0 1
4 303 Jennifer 1 0
It sets the prefix for the dummy columns generated from the Sex column to Column. Now the dummy column names become Column_Female and Column_Male.
Suraj Joshi is a backend software engineer at Matrice.ai.
LinkedIn