Pandas DataFrame DataFrame.drop_duplicates() Function

Suraj Joshi Jan 30, 2023

Pandas Pandas DataFrame

Syntax of pandas.DataFrame.drop_duplicates():
Example Codes: Remove Duplicate Rows Using Pandas DataFrame.set_index() Method
Example Codes: Set subset Parameter in Pandas DataFrame.set_index() Method
Example Codes: Set keep Parameter in Pandas DataFrame.set_index() Method
Example Codes: Set ignore_index Parameter in Pandas DataFrame.set_index() Method

Pandas DataFrame DataFrame.drop_duplicates() Function

The Python Pandas DataFrame.drop_duplicates() function removes all the duplicate rows from the DataFrame.

Syntax of `pandas.DataFrame.drop_duplicates()`:

DataFrame.drop_duplicates(subset: Union[Hashable, Sequence[Hashable], NoneType]=None,
                          keep: Union[str, bool]='first',
                          inplace: bool=False,
                          ignore_index: bool=False)

Parameters


`subset`	Column label or Sequence of labels. Columns to be considered while identifying duplicates
`keep`	`first`, `last`, or `False`. Drop all duplicates except first(`keep=first`), drop all duplicates except last(`keep=first`) or drop all duplicates(`keep=False`)
`inplace`	Boolean. If `True` modify the caller `DataFrame`
`ignore_index`	Boolean. If `True`, the indexes from the original `DataFrame` is ignored. The default value is `False` which means the indexes are used.

Return

If inplace is True, a DataFrame removing all the duplicate rows from the DataFrame; otherwise None.

Example Codes: Remove Duplicate Rows Using Pandas `DataFrame.set_index()` Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','BCD' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates() 

print("DataFrame with Unique Rows:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      BCD
3  Orange     34      Yes      ABC
DataFrame with Unique Rows:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      BCD

The original DataFrame has the first and fourth row identical.

You can remove all the duplicate rows from the DataFrame by using the drop_duplicates() method.

Example Codes: Set `subset` Parameter in Pandas `DataFrame.set_index()` Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates(subset ="Supplier") 

print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ

This method removes all the rows in the DataFrame, which do not have unique values of the Supplier column.

Here, the first, third, and fourth rows have a common value of the Supplier column. So the third and fourth rows are removed from the DataFrame; as by default, the first duplicate row will not be removed.

Example Codes: Set `keep` Parameter in Pandas `DataFrame.set_index()` Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df_unique=df.drop_duplicates(subset ="Supplier",keep="last") 

print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
1   Mango     24       No      XYZ
3  Orange     34      Yes      ABC

This method removes all the rows in the DataFrame, which do not have unique values of the Supplier column, keeping the last duplicate row only.

Here, the first, third, and fourth rows have a common value of the Supplier column. So the first and third rows are removed from the DataFrame.

Example Codes: Set `ignore_index` Parameter in Pandas `DataFrame.set_index()` Method

import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
             ('Mango', 24, 'No','XYZ' ) ,
             ('banana', 14, 'No','ABC' ) ,
            ('Orange', 34, 'Yes' ,'ABC') ]

df = pd.DataFrame(fruit_list, 
                  columns = ['Name',
                             'Price',
                             'In_Stock',
                             'Supplier'])

print("DataFrame:")
print(df)

df.drop_duplicates(subset ="Supplier",keep="last",inplace=True,ignore_index=True) 

print("DataFrame with Unique vales of Supplier Column:")
print(df)

Output:

DataFrame:
     Name  Price In_Stock Supplier
0  Orange     34      Yes      ABC
1   Mango     24       No      XYZ
2  banana     14       No      ABC
3  Orange     34      Yes      ABC
DataFrame with Unique vales of Supplier Column:
     Name  Price In_Stock Supplier
0   Mango     24       No      XYZ
1  Orange     34      Yes      ABC

Here, as ignore_index is set to True, the indexes from the original DataFrame are ignored, and new indices are set for the row.

Due to the inplace=True function, the original DataFrame is modified after calling the ignore_index() function.

Author: Suraj Joshi

Suraj Joshi is a backend software engineer at Matrice.ai.

Syntax of pandas.DataFrame.drop_duplicates():

Parameters

Return

Example Codes: Remove Duplicate Rows Using Pandas DataFrame.set_index() Method

Example Codes: Set subset Parameter in Pandas DataFrame.set_index() Method

Example Codes: Set keep Parameter in Pandas DataFrame.set_index() Method

Example Codes: Set ignore_index Parameter in Pandas DataFrame.set_index() Method

Related Article - Pandas DataFrame

Syntax of `pandas.DataFrame.drop_duplicates()`:

Example Codes: Remove Duplicate Rows Using Pandas `DataFrame.set_index()` Method

Example Codes: Set `subset` Parameter in Pandas `DataFrame.set_index()` Method

Example Codes: Set `keep` Parameter in Pandas `DataFrame.set_index()` Method

Example Codes: Set `ignore_index` Parameter in Pandas `DataFrame.set_index()` Method