Python Pandas Percentile
In Python Pandas, there are some inbuilt functions such as mean()
, mode()
, median()
, and quantile()
function.
Developers consider this function a crucial step in the data cleaning process.
Calculate Pandas Percentile in Python
A luxurious car company wants to develop its branches, for that it is trying to locate the best location thru getting the people’s income.
import pandas as pd
import numpy as np
df = pd.read_excel("C:\\Users\\banga\\Downloads\\Record.xlsx")
df
Output:
# A Dataframe with `Name` and `Monthly Income (in INR)` as columns present in the Record.xlsx file. These columns determine the income of people present in the respective locality.
index Name Monthly Income (in INR)
0 John 30000
1 Joel 35000
2 Albert 45000
3 Ali 50000
4 Jerin 70000
5 Robert 90000
6 Sampath 1000000
We have imported the dataset into a dataframe using pandas. Let us filter the data frame with our requirements.
The company needs the average income of that locality, affecting its growth.
df.describe()
Output:
index Monthly Income (in INR)
count 7.000000
mean 188571.428571
std 358407.522774
min 30000.000000
25% 40000.000000
50% 50000.000000
75% 80000.000000
max 1000000.000000
We can infer that the mean is higher than the average income of the people. We use the percentile
concept to deal with these types of real-time situations.
percentile = df["Monthly Income (in INR)"].quantile(0.99)
percentile
Output:
945399.9999999995
new_df = df[df["Monthly Income (in INR)"] <= percentile]
new_df
Output:
index Name Monthly Income (in INR)
0 John 30000
1 Joel 35000
2 Albert 45000
3 Ali 50000
4 Jerin 70000
5 Robert 90000
We have found one solution, and now we have to fill the null values in a dataframe, mostly average values.
df
Output:
index Name Monthly Income (in INR)
0 John 30000.0
1 Joel 35000.0
2 Albert 45000.0
3 Ali NaN
4 Jerin 70000.0
5 Robert 90000.0
6 Sampath 1000000.0
avg = df["Monthly Income (in INR)"].mean()
df.fillna(avg)
Output:
index Name Monthly Income (in INR)
0 John 30000.000000
1 Joel 35000.000000
2 Albert 45000.000000
3 Ali 211666.666667
4 Jerin 70000.000000
5 Robert 90000.000000
6 Sampath 1000000.000000
We can see that the null value is filled with the average incomes
, but it is not ideal. To overcome this problem, we use another method.
med = df["Monthly Income (in INR)"].median()
df.fillna(med)
Output:
index Name Monthly Income (in INR)
0 John 30000.0
1 Joel 35000.0
2 Albert 45000.0
3 Ali 57500.0
4 Jerin 70000.0
5 Robert 90000.0
6 Sampath 1000000.0
In this way, we can determine the proper value for the company based on its growth.