How to Factorize Data Values in Pandas
In this tutorial, we will learn to factorize in Pandas. We will be using the pandas.factorize()
function to perform the task.
By recognizing different values, the pandas.factorize()
method aids in obtaining the numeric representation of an array.
Firstly, we will import the Pandas
and numpy
libraries and other required libraries.
import numpy as np
import pandas as pd
from pandas.api.types import CategoricalDtype
Use the pandas.factorize()
Function in Pandas
Now we will pass a list containing the characters to the factorize()
function, which will return us the labels and the unique values. We will output the labels and unique values separately.
labels, uniques = pd.factorize(["b", "d", "d", "c", "a", "c", "a", "b"])
The above code will return us the list of the numeric representations of characters and the unique values.
Let us see the output using the below code.
print("Numeric Representation : \n", labels)
print("Unique Values : \n", uniques)
Numeric Representation :
[0 1 1 2 3 2 3 0]
Unique Values :
['b' 'd' 'c' 'a']
We can also sort the alphabet using the below code.
labels, uniques = pd.factorize(["b", "d", "d", "c", "a", "c", "a", "b"], sort=True)
We will have the below output for the above amendment.
Numeric Representation :
[1 3 3 2 0 2 0 1]
Unique Values :
['a' 'b' 'c' 'd']
We can also use categories to divide the data values into a category, and in this case, the unique values will differ. For this purpose, we will use the pd.Categorical()
function to divide our data values.
a = pd.Categorical(["a", "a", "c"], categories=["a", "b", "c"])
label3, unique3 = pd.factorize(a)
Let us now see the output of the above code.
Numeric Representation :
[0 0 1]
Unique Values :
['a', 'c']
Categories (3, object): ['a', 'b', 'c']
We can see in the above output that our unique values list contains only the unique values.
Therefore, we can factorize the data values using Pandas using the following approaches.