Pandas DataFrame DataFrame.drop_duplicates() Function
-
Syntax of
pandas.DataFrame.drop_duplicates()
: -
Example Codes: Remove Duplicate Rows Using Pandas
DataFrame.set_index()
Method -
Example Codes: Set
subset
Parameter in PandasDataFrame.set_index()
Method -
Example Codes: Set
keep
Parameter in PandasDataFrame.set_index()
Method -
Example Codes: Set
ignore_index
Parameter in PandasDataFrame.set_index()
Method
The Python Pandas DataFrame.drop_duplicates()
function removes all the duplicate rows from the DataFrame
.
Syntax of pandas.DataFrame.drop_duplicates()
:
DataFrame.drop_duplicates(subset: Union[Hashable, Sequence[Hashable], NoneType]=None,
keep: Union[str, bool]='first',
inplace: bool=False,
ignore_index: bool=False)
Parameters
subset |
Column label or Sequence of labels. Columns to be considered while identifying duplicates |
keep |
first , last , or False . Drop all duplicates except first(keep=first ), drop all duplicates except last(keep=first ) or drop all duplicates(keep=False ) |
inplace |
Boolean. If True modify the caller DataFrame |
ignore_index |
Boolean. If True , the indexes from the original DataFrame is ignored. The default value is False which means the indexes are used. |
Return
If inplace
is True
, a DataFrame
removing all the duplicate rows from the DataFrame
; otherwise None
.
Example Codes: Remove Duplicate Rows Using Pandas DataFrame.set_index()
Method
import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
('Mango', 24, 'No','XYZ' ) ,
('banana', 14, 'No','BCD' ) ,
('Orange', 34, 'Yes' ,'ABC') ]
df = pd.DataFrame(fruit_list,
columns = ['Name',
'Price',
'In_Stock',
'Supplier'])
print("DataFrame:")
print(df)
df_unique=df.drop_duplicates()
print("DataFrame with Unique Rows:")
print(df_unique)
Output:
DataFrame:
Name Price In_Stock Supplier
0 Orange 34 Yes ABC
1 Mango 24 No XYZ
2 banana 14 No BCD
3 Orange 34 Yes ABC
DataFrame with Unique Rows:
Name Price In_Stock Supplier
0 Orange 34 Yes ABC
1 Mango 24 No XYZ
2 banana 14 No BCD
The original DataFrame
has the first and fourth row identical.
You can remove all the duplicate rows from the DataFrame by using the drop_duplicates()
method.
Example Codes: Set subset
Parameter in Pandas DataFrame.set_index()
Method
import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
('Mango', 24, 'No','XYZ' ) ,
('banana', 14, 'No','ABC' ) ,
('Orange', 34, 'Yes' ,'ABC') ]
df = pd.DataFrame(fruit_list,
columns = ['Name',
'Price',
'In_Stock',
'Supplier'])
print("DataFrame:")
print(df)
df_unique=df.drop_duplicates(subset ="Supplier")
print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)
Output:
DataFrame:
Name Price In_Stock Supplier
0 Orange 34 Yes ABC
1 Mango 24 No XYZ
2 banana 14 No ABC
3 Orange 34 Yes ABC
DataFrame with Unique vales of Supplier Column:
Name Price In_Stock Supplier
0 Orange 34 Yes ABC
1 Mango 24 No XYZ
This method removes all the rows in the DataFrame, which do not have unique values of the Supplier
column.
Here, the first, third, and fourth rows have a common value of the Supplier
column. So the third and fourth rows are removed from the DataFrame
; as by default, the first duplicate row will not be removed.
Example Codes: Set keep
Parameter in Pandas DataFrame.set_index()
Method
import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
('Mango', 24, 'No','XYZ' ) ,
('banana', 14, 'No','ABC' ) ,
('Orange', 34, 'Yes' ,'ABC') ]
df = pd.DataFrame(fruit_list,
columns = ['Name',
'Price',
'In_Stock',
'Supplier'])
print("DataFrame:")
print(df)
df_unique=df.drop_duplicates(subset ="Supplier",keep="last")
print("DataFrame with Unique vales of Supplier Column:")
print(df_unique)
Output:
DataFrame:
Name Price In_Stock Supplier
0 Orange 34 Yes ABC
1 Mango 24 No XYZ
2 banana 14 No ABC
3 Orange 34 Yes ABC
DataFrame with Unique vales of Supplier Column:
Name Price In_Stock Supplier
1 Mango 24 No XYZ
3 Orange 34 Yes ABC
This method removes all the rows in the DataFrame
, which do not have unique values of the Supplier
column, keeping the last duplicate row only.
Here, the first, third, and fourth rows have a common value of the Supplier
column. So the first and third rows are removed from the DataFrame
.
Example Codes: Set ignore_index
Parameter in Pandas DataFrame.set_index()
Method
import pandas as pd
fruit_list = [ ('Orange', 34, 'Yes' ,'ABC') ,
('Mango', 24, 'No','XYZ' ) ,
('banana', 14, 'No','ABC' ) ,
('Orange', 34, 'Yes' ,'ABC') ]
df = pd.DataFrame(fruit_list,
columns = ['Name',
'Price',
'In_Stock',
'Supplier'])
print("DataFrame:")
print(df)
df.drop_duplicates(subset ="Supplier",keep="last",inplace=True,ignore_index=True)
print("DataFrame with Unique vales of Supplier Column:")
print(df)
Output:
DataFrame:
Name Price In_Stock Supplier
0 Orange 34 Yes ABC
1 Mango 24 No XYZ
2 banana 14 No ABC
3 Orange 34 Yes ABC
DataFrame with Unique vales of Supplier Column:
Name Price In_Stock Supplier
0 Mango 24 No XYZ
1 Orange 34 Yes ABC
Here, as ignore_index
is set to True
, the indexes from the original DataFrame
are ignored, and new indices are set for the row.
Due to the inplace=True
function, the original DataFrame
is modified after calling the ignore_index()
function.
Suraj Joshi is a backend software engineer at Matrice.ai.
LinkedIn