How to Compare Pandas DataFrame Object
This tutorial explains how we can compare Pandas DataFrame objects in Python. We can compare DataFrames using the ==
operator.
import pandas as pd
data_season1 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4],
}
data_season2 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4],
}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)
print("df_1:")
print(df_1)
print("")
print("df_2:")
print(df_2)
Output:
df_1:
Player Goals
0 Lewandowski 10
1 Haland 8
2 Ronaldo 6
3 Messi 5
4 Mbappe 4
df_2:
Player Goals
0 Lewandowski 7
1 Haland 8
2 Ronaldo 6
3 Messi 7
4 Mbappe 4
We will use the DataFrames df_1
and df_2
to demonstrate the comparison of DataFrames in this article.
Compare Pandas DataFrame Object Using the ==
Operator
import pandas as pd
data_season1 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4],
}
data_season2 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4],
}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)
print(df_1 == df_2)
Output:
Player Goals
0 True False
1 True True
2 True True
3 True False
4 True True
It compares the corresponding elements of df_1
ad df_2
and returns True
if the corresponding elements of that position are the same, otherwise it returns False
.
We can use pandas.DataFrame.all()
method to know which rows are same in both df_1
and df_2
.
import pandas as pd
data_season1 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4],
}
data_season2 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4],
}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)
print((df_1 == df_2).all(axis=1))
Output:
0 False
1 True
2 True
3 False
4 True
dtype: bool
The rows with True
value in the output have the same value as the corresponding elements. Hence, the rows with False
value in the output have different values of corresponding elements.
We can use indexing to list all the rows whose values differ in df_1
and df_2
.
import pandas as pd
data_season1 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4],
}
data_season2 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4],
}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2)
print(df_1[(df_1 == df_2).all(axis=1) == False])
Output:
Player Goals
0 Lewandowski 10
3 Messi 5
It lists all the rows of df_1
, which have different values than corresponding rows in df_2
.
If we have different indexes for df_1
and df_2
, we get an error saying ValueError: Can only compare identically-labeled DataFrame objects
.
import pandas as pd
data_season1 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4],
}
data_season2 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4],
}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=["a", "b", "c", "d", "e"])
print(df_1 == df_2)
Output:
Traceback (most recent call last):
...
ValueError: Can only compare identically-labeled DataFrame objects
We can use the [pandas.DataFrame.reset_index()
method]](/api/python-pandas/pandas-dataframe-dataframe.reset_index-function/) to reset the indices to overcome the above issue.
import pandas as pd
data_season1 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [10, 8, 6, 5, 4],
}
data_season2 = {
"Player": ["Lewandowski", "Haland", "Ronaldo", "Messi", "Mbappe"],
"Goals": [7, 8, 6, 7, 4],
}
df_1 = pd.DataFrame(data_season1)
df_2 = pd.DataFrame(data_season2, index=["a", "b", "c", "d", "e"])
df_2.reset_index(drop=True, inplace=True)
print(df_1 == df_2)
Output:
Player Goals
0 True False
1 True True
2 True True
3 True False
4 True True
It resets the index of df_2
before comparing df_1
and df_2
so that two dataframes have the same indices to make the comparison possible.
We must also make sure we have the same numbers of rows in DataFrames before comparing them.
Suraj Joshi is a backend software engineer at Matrice.ai.
LinkedIn