拆分 Pandas DataFrame

Suraj Joshi 2023年1月30日
  1. 使用行索引分割 DataFrame
  2. 使用 groupby() 方法拆分 DataFrame
  3. 使用 sample() 方法拆分 DataFrame
拆分 Pandas DataFrame

本教程解釋瞭如何使用行索引、DataFrame.groupby() 方法和 DataFrame.sample() 方法將一個 DataFrame 分割成多個較小的 DataFrame。

我們將使用下面的 apprix_df DataFrame 來解釋如何將一個 DataFrame 分割成多個更小的 DataFrame。

import pandas as pd

apprix_df = pd.DataFrame(
    {
        "Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
        "Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
        "Qualification": ["MBA", "MS", "MCA", "PhD", "BE"],
    }
)

print("Apprix Team DataFrame:")
print(apprix_df, "\n")

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin           MCA
3     Samir    Consultant           PhD
4     Binam      Engineer            BE

使用行索引分割 DataFrame

import pandas as pd

apprix_df = pd.DataFrame(
    {
        "Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
        "Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
        "Qualification": ["MBA", "MS", "MCA", "PhD", "BE"],
    }
)

print("Apprix Team DataFrame:")
print(apprix_df, "\n")

apprix_1 = apprix_df.iloc[:2, :]
apprix_2 = apprix_df.iloc[2:, :]

print("The DataFrames formed by splitting of Apprix Team DataFrame are: ", "\n")
print(apprix_1, "\n")
print(apprix_2, "\n")

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin           MCA
3     Samir    Consultant           PhD
4     Binam      Engineer            BE

The DataFrames formed by splitting the Apprix Team DataFrame are:

       Name Post Qualification
0     Anish  CEO           MBA
1  Rabindra  CTO            MS

     Name          Post Qualification
2  Manish  System Admin           MCA
3   Samir    Consultant           PhD
4   Binam      Engineer            BE

它使用行索引將 DataFrame apprix_df 分成兩部分。第一部分包含 apprix_df DataFrame 的前兩行,而第二部分包含最後三行。

我們可以在 iloc 屬性中指定每次分割的行。[:2,:] 表示選擇索引 2 之前的行(索引 2 的行不包括在內)和 DataFrame 中的所有列。因此,apprix_df.iloc[:2,:] 選擇 DataFrame apprix_df 中索引 01 的前兩行。

使用 groupby() 方法拆分 DataFrame

import pandas as pd

apprix_df = pd.DataFrame(
    {
        "Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
        "Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
        "Qualification": ["MBA", "MS", "MS", "PhD", "MS"],
    }
)

print("Apprix Team DataFrame:")
print(apprix_df, "\n")

groups = apprix_df.groupby(apprix_df.Qualification)
ms_df = groups.get_group("MS")
mba_df = groups.get_group("MBA")
phd_df = groups.get_group("PhD")

print("Group with Qualification MS:")
print(ms_df, "\n")

print("Group with Qualification MBA:")
print(mba_df, "\n")

print("Group with Qualification PhD:")
print(phd_df, "\n")

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
3     Samir    Consultant           PhD
4     Binam      Engineer            MS

Group with Qualification MS:
       Name          Post Qualification
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
4     Binam      Engineer            MS

Group with Qualification MBA:
    Name Post Qualification
0  Anish  CEO           MBA

Group with Qualification PhD:
    Name        Post Qualification
3  Samir  Consultant           PhD

它根據 Qualification 列的值將 DataFrame apprix_df 分成三部分。Qualification 列值相同的行將被放在同一個組中。

groupby() 函式將根據 Qualification 列的值形成分組。然後我們使用 get_group() 方法提取被 groupby() 方法分組的行。

使用 sample() 方法拆分 DataFrame

我們可以通過使用 sample() 方法從 DataFrame 中隨機抽取行來形成一個 DataFrame。我們可以設定從父 DataFrame 中抽取行的比例。

import pandas as pd

apprix_df = pd.DataFrame(
    {
        "Name": ["Anish", "Rabindra", "Manish", "Samir", "Binam"],
        "Post": ["CEO", "CTO", "System Admin", "Consultant", "Engineer"],
        "Qualification": ["MBA", "MS", "MS", "PhD", "MS"],
    }
)

print("Apprix Team DataFrame:")
print(apprix_df, "\n")

random_df = apprix_df.sample(frac=0.4, random_state=60)

print("Random split from the Apprix Team DataFrame:")
print(random_df)

輸出:

Apprix Team DataFrame:
       Name          Post Qualification
0     Anish           CEO           MBA
1  Rabindra           CTO            MS
2    Manish  System Admin            MS
3     Samir    Consultant           PhD
4     Binam      Engineer            MS

Random split from the Apprix Team DataFrame:
    Name      Post Qualification
0  Anish       CEO           MBA
4  Binam  Engineer            MS

它從 apprix_df DataFrame 中隨機抽取 40% 的行,然後顯示由抽取的行形成的 DataFrame。設定 random_state 是為了確保每次抽樣都能得到相同的隨機樣本。

作者: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn

相關文章 - Pandas DataFrame