How to Drop Duplicate Pandas Rows
-
DataFrame.drop_duplicates()
Syntax -
Remove Duplicate Rows Using the
DataFrame.drop_duplicates()
Method -
Set
keep='last'
in thedrop_duplicates()
Method
This tutorial explains how we can remove all the duplicate rows from a Pandas DataFrame using the DataFrame.drop_duplicates()
method.
DataFrame.drop_duplicates()
Syntax
DataFrame.drop_duplicates(subset=None, keep="first", inplace=False, ignore_index=False)
It returns a DataFrame removing all the repeated rows in the DataFrame.
Remove Duplicate Rows Using the DataFrame.drop_duplicates()
Method
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates()
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
Output:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
It removes the rows having the same values all for all the columns. By default, only the rows having the same values for each column in the DataFrame are considered as duplicates. In the df_with_duplicates
DataFrame, the first and fifth row have the same values for all the columns, s that the fifth row is removed.
Set subset
Parameter to Remove Duplicates Based on Specific Columns Only
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"])
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
Output:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
Here, we pass Name
as a subset
argument to the drop_duplicates()
method. The fourth and fifth rows are removed as they have the same value of the Name
column as the first column.
Set keep='last'
in the drop_duplicates()
Method
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep="last")
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
Output:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
5 302 Watch 300
It removes all the rows except the last row having the same value as the Name
column.
We set keep=False
to remove all the rows having the same value of any column.
import pandas as pd
df_with_duplicates = pd.DataFrame(
{
"Id": [302, 504, 708, 103, 303, 302],
"Name": ["Watch", "Camera", "Phone", "Shoes", "Watch", "Watch"],
"Cost": ["300", "400", "350", "100", "300", "300"],
}
)
df_without_duplicates = df_with_duplicates.drop_duplicates(subset=["Name"], keep=False)
print("DataFrame with duplicates:")
print(df_with_duplicates, "\n")
print("DataFrame without duplicates:")
print(df_without_duplicates, "\n")
Output:
DataFrame with duplicates:
Id Name Cost
0 302 Watch 300
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
4 303 Watch 300
5 302 Watch 300
DataFrame without duplicates:
Id Name Cost
1 504 Camera 400
2 708 Phone 350
3 103 Shoes 100
It removes the first, fifth, and sixth row as they all have the same value for the Name
column.
Suraj Joshi is a backend software engineer at Matrice.ai.
LinkedInRelated Article - Pandas DataFrame Row
- How to Get the Row Count of a Pandas DataFrame
- How to Randomly Shuffle DataFrame Rows in Pandas
- How to Filter Dataframe Rows Based on Column Values in Pandas
- How to Iterate Through Rows of a DataFrame in Pandas
- How to Get Index of All Rows Whose Particular Column Satisfies Given Condition in Pandas
- How to Find Duplicate Rows in a DataFrame Using Pandas