How to Convert Categorical Variable to Numeric in Pandas
- Convert Categorical Variable to Numeric Variable in Pandas
-
Use the
apply
Function to Convert Categorical Variable to Numeric Variable in Pandas
This tutorial explores the concept of converting categorical variables to numeric variables in Pandas.
Convert Categorical Variable to Numeric Variable in Pandas
This tutorial lets us understand how and why to convert a certain variable from one to another, particularly how to convert a categorical data type variable to a numeric variable.
One might need to perform such an operation because a certain data type might not be feasible for the analyst’s analysis or interpretation task. Under such a situation, Pandas helps convert a certain type of variable to another variable.
Let us understand how to perform such a complex operation.
However, we create a dummy data frame to work with before we begin. Here we create one data frame, namely, df
.
We add a few columns and certain data within this df
data frame. We can do this operation using the following code.
import pandas as pd
df = pd.DataFrame(
{"col1": [1, 2, 3, 4, 5], "col2": list("abcab"), "col3": list("ababb")}
)
The above code creates a data frame along with a few entries. To view the entries in the data, we use the following code.
print(df)
The above code gives the following output.
col1 col2 col3
0 1 a a
1 2 b b
2 3 c a
3 4 a b
4 5 b b
As we can see, we have four columns and 5 rows indexed from value 0 to value 4. Looking into our data frame, we can see that we have certain numeric values in our data and others, alphabets.
Our job is to now convert these alphabetical values into numeric values.
Use the apply
Function to Convert Categorical Variable to Numeric Variable in Pandas
Let us get straight to our task as we have our data set up. The first step would be to visualize the category of each column.
This category in other programming languages is also called data types. We use the following code to view the data types associated with each column.
df["col2"] = df["col2"].astype("category")
df["col3"] = df["col3"].astype("category")
print(df.dtypes)
The output of the code can be illustrated below.
col1 int64
col2 category
col3 category
dtype: object
As we can see, we have the data type for each column listed in the table above. We have col1
with the data type as int64
and col2
with category
. The col3
is also similar to that of col2
.
Now that we know the data types for each column, we can move on to the next step.
The next step is to find the categorical columns and list them together. This is not a difficult but an extremely important step in our operation as it helps us understand which columns are to be converted to numeric variables.
cat_columns = df.select_dtypes(["category"]).columns
As shown in the code, we fetch all the columns with dtypes
equal to category
. Similarly, we can fetch any dtype
as per our requirement.
Now that we have found all our categorical columns let’s visualize them. We can perform this operation using the following code.
print(cat_columns)
The code fetches the following output.
Index(['col2', 'col3'], dtype='object')
This would indicate the dtype
associated with the categorical columns.
The last step is to convert these categorical variables to numeric variables. We can perform this operation using the following code.
df[cat_columns] = df[cat_columns].apply(lambda x: x.cat.codes)
The code fetches the following output.
col1 col2 col3
0 1 0 0
1 2 1 1
2 3 2 0
3 4 0 1
4 5 1 1
We can get the output using the code print(df)
.
As shown in the output above, we have successfully converted the alphabets to numeric values, thereby helping us convert categorical variables to numeric variables.
Thus, using the apply
function and fetching the categorical columns, we have converted variables from categorical to numeric in our data frame.