Pandas DataFrame DataFrame.drop_duplicates() Function

Suraj Joshi Jan 30, 2023
  1. Syntax of pandas.DataFrame.drop_duplicates():
  2. Example Codes: Remove Duplicate Rows Using Pandas DataFrame.set_index() Method
  3. Example Codes: Set subset Parameter in Pandas DataFrame.set_index() Method
  4. Example Codes: Set keep Parameter in Pandas DataFrame.set_index() Method
  5. Example Codes: Set ignore_index Parameter in Pandas DataFrame.set_index() Method
Pandas DataFrame DataFrame.drop_duplicates() Function

The Python Pandas DataFrame.drop_duplicates() function removes all the duplicate rows from the DataFrame.

Syntax of pandas.DataFrame.drop_duplicates():

DataFrame.drop_duplicates(subset: Union[Hashable, Sequence[Hashable], NoneType]=None,
                          keep: Union[str, bool]='first',
                          inplace: bool=False,
                          ignore_index: bool=False)

Parameters

subset Column label or Sequence of labels. Columns to be considered while identifying duplicates
keep first, last, or False. Drop all duplicates except first(keep=first), drop all duplicates except last(keep=first) or drop all duplicates(keep=False)
inplace Boolean. If True modify the caller DataFrame
ignore_index Boolean. If True, the indexes from the original DataFrame is ignored. The default value is False which means the indexes are used.

Return

If inplace is True, a DataFrame removing all the duplicate rows from the DataFrame; otherwise None.

Example Codes: Remove Duplicate Rows Using Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','BCD' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates() 

print("DataFrame with Unique Rows:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      BCD
3  Orange     34      Yes      ABC
DataFrame with Unique Rows:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      BCD

The original DataFrame has the first and fourth row identical.

You can remove all the duplicate rows from the DataFrame by using the drop_duplicates() method.

Example Codes: Set subset Parameter in Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates(subset ="Supplier") 

print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ

This method removes all the rows in the DataFrame, which do not have unique values of the Supplier column.

Here, the first, third, and fourth rows have a common value of the Supplier column. So the third and fourth rows are removed from the DataFrame; as by default, the first duplicate row will not be removed.

Example Codes: Set keep Parameter in Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates(subset ="Supplier",keep="last") 

print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
1   Mango     24       No      XYZ
3  Orange     34      Yes      ABC

This method removes all the rows in the DataFrame, which do not have unique values of the Supplier column, keeping the last duplicate row only.

Here, the first, third, and fourth rows have a common value of the Supplier column. So the first and third rows are removed from the DataFrame.

Example Codes: Set ignore_index Parameter in Pandas DataFrame.set_index() Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df.drop_duplicates(subset ="Supplier",keep="last",inplace=True,ignore_index=True) 

print("DataFrame with Unique vales of Supplier Column:")
print(df)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
0   Mango     24       No      XYZ
1  Orange     34      Yes      ABC

Here, as ignore_index is set to True, the indexes from the original DataFrame are ignored, and new indices are set for the row.

Due to the inplace=True function, the original DataFrame is modified after calling the ignore_index() function.

Author: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

Related Article - Pandas DataFrame