Pandas 多列合并
Suraj Joshi
2023年1月30日
本教程介绍了如何在 Pandas 中使用 DataFrame.merge()
方法合并两个 DataFrame。
import pandas as pd
roll_no = [501, 502, 503, 504, 505]
student_df = pd.DataFrame(
{
"Roll No": [500, 501, 503, 504, 505, 506],
"Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
"Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
"Age": [17, 18, 17, 16, 18, 16],
}
)
grades_df = pd.DataFrame(
{
"Roll No": [501, 502, 503, 504, 505, 506],
"Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
"Grades": ["A", "B+", "A-", "A", "B", "A+"],
}
)
print("1st DataFrame:")
print(student_df, "\n")
print("2nd DataFrame:")
print(grades_df, "\n")
print("Merged df:")
print(merged_df)
输出:
1st DataFrame:
Roll No Name Gender Age
0 500 Jennifer Female 17
1 501 Travis Male 18
2 503 Bob Male 17
3 504 Emma Female 16
4 505 Luna Female 18
5 506 Anish Male 16
2nd DataFrame:
Roll No Name Grades
0 501 Jennifer A
1 502 Travis B+
2 503 Bob A-
3 504 Emma A
4 505 Luna B
5 506 Anish A+
我们将使用 DataFrame student_df
和 grades_df
来演示 DataFrame.merge()
的工作。
Pandas DataFrame 不含任何键列的默认合并
如果我们只使用传递两个 DataFrames 来合并到 merge()
方法,该方法将收集两个 DataFrame 中的所有公共列,并将两个 DataFrame 中的每个公共列替换为一个。
import pandas as pd
roll_no = [501, 502, 503, 504, 505]
student_df = pd.DataFrame(
{
"Roll No": [500, 501, 503, 504, 505, 506],
"Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
"Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
"Age": [17, 18, 17, 16, 18, 16],
}
)
grades_df = pd.DataFrame(
{
"Roll No": [501, 502, 503, 504, 505, 506],
"Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
"Grades": ["A", "B+", "A-", "A", "B", "A+"],
}
)
merged_df = pd.merge(student_df, grades_df)
print("1st DataFrame:")
print(student_df, "\n")
print("2nd DataFrame:")
print(grades_df, "\n")
print("Merged df:")
print(merged_df)
输出:
1st DataFrame:
Roll No Name Gender Age
0 500 Jennifer Female 17
1 501 Travis Male 18
2 503 Bob Male 17
3 504 Emma Female 16
4 505 Luna Female 18
5 506 Anish Male 16
2nd DataFrame:
Roll No Name Grades
0 501 Jennifer A
1 502 Travis B+
2 503 Bob A-
3 504 Emma A
4 505 Luna B
5 506 Anish A+
Merged df:
Roll No Name Gender Age Grades
0 503 Bob Male 17 A-
1 504 Emma Female 16 A
2 505 Luna Female 18 B
3 506 Anish Male 16 A+
它将合并 DataFrame student_df
和 grades_df
,并分配给 merged_df
。我们有两列 Roll No
和 Name
是两个 DataFrame 共有的,但 merge()
函数会将每个通用列合并为一列。
Pandas 设置 on
参数的值来指定合并的键值
import pandas as pd
roll_no = [501, 502, 503, 504, 505]
student_df = pd.DataFrame(
{
"Roll No": [500, 501, 503, 504, 505, 506],
"Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
"Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
"Age": [17, 18, 17, 16, 18, 16],
}
)
grades_df = pd.DataFrame(
{
"Roll No": [501, 502, 503, 504, 505, 506],
"Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
"Grades": ["A", "B+", "A-", "A", "B", "A+"],
}
)
merged_df = pd.merge(student_df, grades_df, on="Roll No")
print("1st DataFrame:")
print(student_df, "\n")
print("2nd DataFrame:")
print(grades_df, "\n")
print("Merged df:")
print(merged_df)
输出:
1st DataFrame:
Roll No Name Gender Age
0 500 Jennifer Female 17
1 501 Travis Male 18
2 503 Bob Male 17
3 504 Emma Female 16
4 505 Luna Female 18
5 506 Anish Male 16
2nd DataFrame:
Roll No Name Grades
0 501 Jennifer A
1 502 Travis B+
2 503 Bob A-
3 504 Emma A
4 505 Luna B
5 506 Anish A+
Merged df:
Roll No Name_x Gender Age Name_y Grades
0 501 Travis Male 18 Jennifer A
1 503 Bob Male 17 Bob A-
2 504 Emma Female 16 Emma A
3 505 Luna Female 18 Luna B
4 506 Anish Male 16 Anish A+
这里,我们设置 on="Roll No"
,merge()
函数将在两个 DataFrame 中找到 Roll No
命名的列,我们在 merged_df
将会只有一个 Roll No
列。虽然 Name
列在两个 DataFrames 中也是通用的,但由于 Name
不作为 on
参数传递,所以我们为左右 DataFrame 的 Name
列单独设置了一列,分别由 Name_x
和 Name_y
表示。
使用 left_on
和 right_on
合并 DataFrame
import pandas as pd
roll_no = [501, 502, 503, 504, 505]
student_df = pd.DataFrame(
{
"Roll No": [500, 501, 503, 504, 505, 506],
"Name": ["Jennifer", "Travis", "Bob", "Emma", "Luna", "Anish"],
"Gender": ["Female", "Male", "Male", "Female", "Female", "Male"],
"Age": [17, 18, 17, 16, 18, 16],
}
)
grades_df = pd.DataFrame(
{"Id": [501, 502, 503, 504, 505, 506], "Grades": ["A", "B+", "A-", "A", "B", "A+"]}
)
merged_df = pd.merge(student_df, grades_df, left_on="Roll No", right_on="Id")
print("1st DataFrame:")
print(student_df, "\n")
print("2nd DataFrame:")
print(grades_df, "\n")
print("Merged df:")
print(merged_df)
输出:
1st DataFrame:
Roll No Name Gender Age
0 500 Jennifer Female 17
1 501 Travis Male 18
2 503 Bob Male 17
3 504 Emma Female 16
4 505 Luna Female 18
5 506 Anish Male 16
2nd DataFrame:
Id Grades
0 501 A
1 502 B+
2 503 A-
3 504 A
4 505 B
5 506 A+
Merged df:
Roll No Name Gender Age Id Grades
0 501 Travis Male 18 501 A
1 503 Bob Male 17 503 A-
2 504 Emma Female 16 504 A
3 505 Luna Female 18 505 B
4 506 Anish Male 16 506 A+
如果我们要合并的一列在 DataFrames 中有不同的列名,我们可以使用 left_on
和 right_on
参数。left_on
将被设置为左边 DataFrame 中的列名,right_on
将被设置为右边 DataFrame 中的列名。
作者: Suraj Joshi
Suraj Joshi is a backend software engineer at Matrice.ai.
LinkedIn