How to Split Pandas DataFrame
- Split DataFrame Using the Row Indexing
-
Split DataFrame Using the
groupby()
Method -
Split DataFrame Using the
sample()
Method
This tutorial explains how we can split a DataFrame into multiple smaller DataFrames using row indexing, the DataFrame.groupby()
method, and DataFrame.sample()
method.
We will use the apprix_df
DataFrame below to explain how we can split a DataFrame into multiple smaller DataFrames.
import pandas as pd
apprix_df = pd.DataFrame(
{
"Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
"Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
"Qualification": ["MBA", "MS", "MCA", "PhD", "BE"],
}
)
print("Apprix Team DataFrame:")
print(apprix_df, "\n")
Output:
Apprix Team DataFrame:
Name Post Qualification
0 Anish CEO MBA
1 Rabindra CTO MS
2 Manish System Admin MCA
3 Samir Consultant PhD
4 Binam Engineer BE
Split DataFrame Using the Row Indexing
import pandas as pd
apprix_df = pd.DataFrame(
{
"Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
"Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
"Qualification": ["MBA", "MS", "MCA", "PhD", "BE"],
}
)
print("Apprix Team DataFrame:")
print(apprix_df, "\n")
apprix_1 = apprix_df.iloc[:2, :]
apprix_2 = apprix_df.iloc[2:, :]
print("The DataFrames formed by splitting of Apprix Team DataFrame are: ", "\n")
print(apprix_1, "\n")
print(apprix_2, "\n")
Output:
Apprix Team DataFrame:
Name Post Qualification
0 Anish CEO MBA
1 Rabindra CTO MS
2 Manish System Admin MCA
3 Samir Consultant PhD
4 Binam Engineer BE
The DataFrames formed by splitting the Apprix Team DataFrame are:
Name Post Qualification
0 Anish CEO MBA
1 Rabindra CTO MS
Name Post Qualification
2 Manish System Admin MCA
3 Samir Consultant PhD
4 Binam Engineer BE
It splits the DataFrame apprix_df
into two parts using the row indexing. The first part contains the first two rows from the apprix_df
DataFrame, while the second part contains the last three rows.
We can specify the rows to be included in each split in the iloc
property. [:2,:]
represents select the rows up to row with index 2
exclusive (the row with index 2
is not included) and all the columns from the DataFrame. Hence, apprix_df.iloc[:2,:]
selects first two rows from the DataFrame apprix_df
with index 0
and 1
.
Split DataFrame Using the groupby()
Method
import pandas as pd
apprix_df = pd.DataFrame(
{
"Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
"Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
"Qualification": ["MBA", "MS", "MS", "PhD", "MS"],
}
)
print("Apprix Team DataFrame:")
print(apprix_df, "\n")
groups = apprix_df.groupby(apprix_df.Qualification)
ms_df = groups.get_group("MS")
mba_df = groups.get_group("MBA")
phd_df = groups.get_group("PhD")
print("Group with Qualification MS:")
print(ms_df, "\n")
print("Group with Qualification MBA:")
print(mba_df, "\n")
print("Group with Qualification PhD:")
print(phd_df, "\n")
Output:
Apprix Team DataFrame:
Name Post Qualification
0 Anish CEO MBA
1 Rabindra CTO MS
2 Manish System Admin MS
3 Samir Consultant PhD
4 Binam Engineer MS
Group with Qualification MS:
Name Post Qualification
1 Rabindra CTO MS
2 Manish System Admin MS
4 Binam Engineer MS
Group with Qualification MBA:
Name Post Qualification
0 Anish CEO MBA
Group with Qualification PhD:
Name Post Qualification
3 Samir Consultant PhD
It splits the DataFrame apprix_df
into three parts based on the value of the Qualification
column. The rows with the same value of the Qualification
column will be placed in the same group.
The groupby()
function will form groups based on the Qualification
column’s value. We then extract the rows grouped by groupby()
method using the get_group()
method.
Split DataFrame Using the sample()
Method
We can form a DataFrame by sampling rows randomly from a DataFrame using the sample()
method. We can set the ratio of rows to be sampled from the parent DataFrame.
import pandas as pd
apprix_df = pd.DataFrame(
{
"Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
"Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
"Qualification": ["MBA", "MS", "MS", "PhD", "MS"],
}
)
print("Apprix Team DataFrame:")
print(apprix_df, "\n")
random_df = apprix_df.sample(frac=0.4, random_state=60)
print("Random split from the Apprix Team DataFrame:")
print(random_df)
Output:
Apprix Team DataFrame:
Name Post Qualification
0 Anish CEO MBA
1 Rabindra CTO MS
2 Manish System Admin MS
3 Samir Consultant PhD
4 Binam Engineer MS
Random split from the Apprix Team DataFrame:
Name Post Qualification
0 Anish CEO MBA
4 Binam Engineer MS
It randomly samples 40% of the rows from the apprix_df
DataFrame and then displays the DataFrame formed from the sampled rows. The random_state
is set to ensure that we get the same random samples on sampling every time.
Suraj Joshi is a backend software engineer at Matrice.ai.
LinkedIn