Pandas 多列合併

Suraj Joshi 2023年1月30日
  1. Pandas DataFrame 不含任何鍵列的預設合併
  2. Pandas 設定 on 引數的值來指定合併的鍵值
  3. 使用 left_onright_on 合併 DataFrame
Pandas 多列合併

本教程介紹瞭如何在 Pandas 中使用 DataFrame.merge() 方法合併兩個 DataFrame。

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

輸出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
   Roll No      Name Grades
0      501  Jennifer      A
1      502    Travis     B+
2      503       Bob     A-
3      504      Emma      A
4      505      Luna      B
5      506     Anish     A+ 

我們將使用 DataFrame student_dfgrades_df 來演示 DataFrame.merge() 的工作。

Pandas DataFrame 不含任何鍵列的預設合併

如果我們只使用傳遞兩個 DataFrames 來合併到 merge() 方法,該方法將收集兩個 DataFrame 中的所有公共列,並將兩個 DataFrame 中的每個公共列替換為一個。

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

merged_df = pd.merge(student_df, grades_df)

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

輸出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
   Roll No      Name Grades
0      501  Jennifer      A
1      502    Travis     B+
2      503       Bob     A-
3      504      Emma      A
4      505      Luna      B
5      506     Anish     A+ 

Merged df:
   Roll No   Name  Gender  Age Grades
0      503    Bob    Male   17     A-
1      504   Emma  Female   16      A
2      505   Luna  Female   18      B
3      506  Anish    Male   16     A+

它將合併 DataFrame student_dfgrades_df,並分配給 merged_df。我們有兩列 Roll NoName 是兩個 DataFrame 共有的,但 merge() 函式會將每個通用列合併為一列。

Pandas 設定 on 引數的值來指定合併的鍵值

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {
        "Roll No": [501, 502, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Grades": ["A", "B+", "A-", "A", "B", "A+"],
    }
)

merged_df = pd.merge(student_df, grades_df, on="Roll No")

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

輸出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
   Roll No      Name Grades
0      501  Jennifer      A
1      502    Travis     B+
2      503       Bob     A-
3      504      Emma      A
4      505      Luna      B
5      506     Anish     A+ 

Merged df:
   Roll No  Name_x  Gender  Age    Name_y Grades
0      501  Travis    Male   18  Jennifer      A
1      503     Bob    Male   17       Bob     A-
2      504    Emma  Female   16      Emma      A
3      505    Luna  Female   18      Luna      B
4      506   Anish    Male   16     Anish     A+

這裡,我們設定 on="Roll No"merge() 函式將在兩個 DataFrame 中找到 Roll No 命名的列,我們在 merged_df 將會只有一個 Roll No 列。雖然 Name 列在兩個 DataFrames 中也是通用的,但由於 Name 不作為 on 引數傳遞,所以我們為左右 DataFrame 的 Name 列單獨設定了一列,分別由 Name_xName_y 表示。

使用 left_onright_on 合併 DataFrame

import pandas as pd

roll_no = [501, 502, 503, 504, 505]

student_df = pd.DataFrame(
    {
        "Roll No": [500, 501, 503, 504, 505, 506],
        "Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
        "Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
        "Age": [17, 18, 17, 16, 18, 16],
    }
)

grades_df = pd.DataFrame(
    {"Id": [501, 502, 503, 504, 505, 506], "Grades": ["A", "B+", "A-", "A", "B", "A+"]}
)

merged_df = pd.merge(student_df, grades_df, left_on="Roll No", right_on="Id")

print("1st DataFrame:")
print(student_df, "\n")

print("2nd DataFrame:")
print(grades_df, "\n")

print("Merged df:")
print(merged_df)

輸出:

1st DataFrame:
   Roll No      Name  Gender  Age
0      500  Jennifer  Female   17
1      501    Travis    Male   18
2      503       Bob    Male   17
3      504      Emma  Female   16
4      505      Luna  Female   18
5      506     Anish    Male   16 

2nd DataFrame:
    Id Grades
0  501      A
1  502     B+
2  503     A-
3  504      A
4  505      B
5  506     A+ 

Merged df:
   Roll No    Name  Gender  Age   Id Grades
0      501  Travis    Male   18  501      A
1      503     Bob    Male   17  503     A-
2      504    Emma  Female   16  504      A
3      505    Luna  Female   18  505      B
4      506   Anish    Male   16  506     A+

如果我們要合併的一列在 DataFrames 中有不同的列名,我們可以使用 left_onright_on 引數。left_on 將被設定為左邊 DataFrame 中的列名,right_on 將被設定為右邊 DataFrame 中的列名。

作者: Suraj Joshi
Suraj Joshi avatar Suraj Joshi avatar

Suraj Joshi is a backend software engineer at Matrice.ai.

LinkedIn