Pandas Fuzzy Match
This article educates how to merge data frames and see how to apply the fuzzy match to compare two pandas’ data frames in python.
Apply Fuzzy Match on Pandas Data Frame in Python
Suppose we have the following use case with two different tables, and we want to merge them into a common column; look at an example.
We have df1
, the first data frame, and df2
, the second data frame, and both contain the column Company_Name
. For example, see the following code block.
import pandas as pd
df1 = [
"Amazing Art Inc.",
"Amazing Ants Ltd.",
"Beautiful Buildings Inc.",
"Clean Communication Inc.",
"Dramatic Dramas Ltd.",
]
df2 = [
"Amazing art gnc.",
"Amazing Aant Ltd.",
"Beautiful Buildings Inc",
"Clear Communication Inc.",
"Pramatic Dramas Inc.",
]
df1 = pd.DataFrame(df1, columns=["Company_Name"])
df2 = pd.DataFrame(reversed(df2), columns=["Company_Name"])
When we compare them, we will see that they are pretty similar names, but they are not precisely identical such as we have Amazing Art Inc.
, and in another data frame, we have Amazing Art gnc.
. Few changes are made to other items.
When working with real-life data, we cannot match these items; fortunately, there is a solution using the fuzzywuzzy
library. Now we will see how to install this library in Jupyter because we are using the Jupyter notebook.
We will import the sys
library and use the executable
inside the curly brackets, and in front of this, we will write a command.
import sys
!{sys.executable} -m pip install fuzzywuzzy
Output:
Collecting fuzzywuzzy
Downloading fuzzywuzzy-0.18.0-py2.py3-none-any.whl (18 kB)
Installing collected packages: fuzzywuzzy
Successfully installed fuzzywuzzy-0.18.0
After installing successfully, we load the required function and module from the fuzzywuzzy
library.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
Now we will use the extractOne()
function from the process
to show how it works, and it will only return the most likely match that it found. So, for example, we want to match Beautiful Bldgfs
and then compare it to df1
using the column name.
process.extractOne("Beautiful Bldgfs", df1["Company_Name"])
As we can see, it returns Beautiful Buildings Inc.
, and 77
is an estimation or confidence that this is the correct match. And in the last numerical value, 2
shows which position it has in the data frame.
('Beautiful Buildings Inc.', 77, 2)
To compare the entire data frame for all the columns, we would create a new column, load the Company_Name
column from df1
, and map it using the map()
function. Using the lambda
function we call extractOne()
.
Inside this function, we put df2
and check the column to find the closest match, and afterwards, we could use indexes because it returns a tuple, and we only want to return the first entry of the tuple.
After calling df1
, we see we got the closest matches for all entries.
df1["Best_Match"] = df1["Company_Name"].map(
lambda x: process.extractOne(x, df2["Company_Name"])[0]
)
df1
Output:
Here is the complete Python code for the above example.
from fuzzywuzzy import fuzz
from fuzzywuzzy import process
import pandas as pd
df1 = [
"Amazing Art Inc.",
"Amazing Ants Ltd.",
"Beautiful Buildings Inc.",
"Clean Communication Inc.",
"Dramatic Dramas Ltd.",
]
df2 = [
"Amazing art gnc.",
"Amazing Aant Ltd.",
"Beautiful Buildings Inc",
"Clear Communication Inc.",
"Pramatic Dramas Inc.",
]
df1 = pd.DataFrame(df1, columns=["Company_Name"])
df2 = pd.DataFrame(reversed(df2), columns=["Company_Name"])
print(process.extractOne("Beautiful Bldgfs", df1["Company_Name"]))
df1["Best_Match"] = df1["Company_Name"].map(
lambda x: process.extractOne(x, df2["Company_Name"])[0]
)
print(df1)
Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.
LinkedIn