Pandas Anti-Join
- Pandas Anti-Join and Its Types
- Perform the Left Anti-Join in Pandas
- Perform the Right Anti-Join in Pandas
This tutorial describes Pandas’ anti-joins, briefly describes their types and demonstrates each using example codes.
Pandas Anti-Join and Its Types
By using anti-join, we can return all the rows (also known as records and documents) in one dataset that do not find the matching value in the other dataset. These are used to manipulate datasets as per the project requirements.
There are two types of anti-joins in Pandas that are listed below:
- Left Anti-Join - It returns the records in the left dataset that have no matching records in the right dataset.
- Right Anti-Join - It returns the records in the right dataset that have not matched those in the left dataset.
See the above tabular representation. We are using the left anti-join, which returns the left dataset, excluding the intersection.
Note that it only returns columns from the left dataset and not from the right.
Similarly, using the right anti-join will return the right dataset, excluding the intersection. Like left anti-join, it will also return columns from the right dataset only and not from the left.
Let’s learn how we can use both of these anti-joins in Pandas. Note that you must have enough knowledge of SQL joins to grasp the anti-joins firmly.
Perform the Left Anti-Join in Pandas
-
Import the library.
import pandas as pd
First, we import the
pandas
library to play with data frames. -
Create two data frames.
# first DataFrame df1 = pd.DataFrame( {"sections": ["A", "B", "C", "D", "E"], "points": [19, 23, 20, 15, 31]} ) print("First DataFrame:") print(df1) # second DataFrame df2 = pd.DataFrame( {"sections": ["A", "B", "C", "F", "G"], "points": [19, 23, 20, 24, 30]} ) print("\n\nSecond DataFrame:") print(df2)
We create two data frames containing sample
points
for differentsections
that you can see below.OUTPUT:
First DataFrame: sections points 0 A 19 1 B 23 2 C 20 3 D 15 4 E 31 Second DataFrame: sections points 0 A 19 1 B 23 2 C 20 3 F 24 4 G 30
-
Perform the outer join.
outer_join = df1.merge(df2, how="outer", indicator=True) print(outer_join)
To perform anti-join, we need to go through the outer join, which returns the matched and unmatched values from either or both datasets. We are getting matched and unmatched values from both datasets (you can see this in the following output).
Here, the
merge()
method is used to update the data of two data frames by using a particular method(s) to merge them. We are using a few parameters to control what values to replace and what to keep.df2
- It is another data frame to merge with.how
- It specifies how to merge. It is an optional parameter whose values can beleft
,right
,outer
,inner
, orcross
; by default, it isinner
.indicator
- It can be set toTrue
,False
, or a string type value. If we set it toTrue
, it will add the_merge
column to the output data frame having the information on a source of every row.
We can also give a different name to the
_merge
column by specifying a string argument. This column will contain the categorical type values, for instance,left_only
,right_only
, andboth
, as we have in the following output.Here,
both
means if the merge key of the observation (row) is found in both datasets,left_only
shows that the merge key of the observation is found in the left dataset only, whileright_only
indicates that the observation’s merge key is found in right dataset only.OUTPUT:
sections points _merge 0 A 19 both 1 B 23 both 2 C 20 both 3 D 15 left_only 4 E 31 left_only 5 F 24 right_only 6 G 30 right_only
-
Perform the left anti-join.
lef_anti_join = outer_join[(outer_join._merge == "left_only")].drop("_merge", axis=1) print(lef_anti_join)
First, we use
outer_join[(outer_join._merge=='left_only')]
to retrieve all rows having aleft_only
value in the_merge
column, then chain it with the.drop()
method, which drops the_merge
column from the output data frame.OUTPUT:
sections points 3 D 15 4 E 31
Now see, we get columns from the left data frame (
df1
), excluding the intersection. -
See the complete source code to perform the left anti-join in Pandas.
import pandas as pd # first DataFrame df1 = pd.DataFrame( {"sections": ["A", "B", "C", "D", "E"], "points": [19, 23, 20, 15, 31]} ) print("First DataFrame:") print(df1) # second DataFrame df2 = pd.DataFrame( {"sections": ["A", "B", "C", "F", "G"], "points": [19, 23, 20, 24, 30]} ) print("\n\nSecond DataFrame:") print(df2) # outer join outer_join = df1.merge(df2, how="outer", indicator=True) # left anti join lef_anti_join = outer_join[(outer_join._merge == "left_only")].drop("_merge", axis=1) print("\n\nLeft Anti-join:") print(lef_anti_join)
OUTPUT:
First DataFrame: sections points 0 A 19 1 B 23 2 C 20 3 D 15 4 E 31 Second DataFrame: sections points 0 A 19 1 B 23 2 C 20 3 F 24 4 G 30 Left Anti-join: sections points 3 D 15 4 E 31
Perform the Right Anti-Join in Pandas
We have learned step-by-step how to perform left anti-join in Pandas. The right anti-join can also be done similarly, but here, we will select those rows with a right_only
value in the _merge
column.
Example Code:
import pandas as pd
# first DataFrame
df1 = pd.DataFrame(
{"sections": ["A", "B", "C", "D", "E"], "points": [19, 23, 20, 15, 31]}
)
print("First DataFrame:")
print(df1)
# second DataFrame
df2 = pd.DataFrame(
{"sections": ["A", "B", "C", "F", "G"], "points": [19, 23, 20, 24, 30]}
)
print("\n\nSecond DataFrame:")
print(df2)
# outer join
outer_join = df1.merge(df2, how="outer", indicator=True)
# right anti join
right_anti_join = outer_join[(outer_join._merge == "right_only")].drop("_merge", axis=1)
print("\n\nRight Anti-join:")
print(right_anti_join)
OUTPUT:
First DataFrame:
sections points
0 A 19
1 B 23
2 C 20
3 D 15
4 E 31
Second DataFrame:
sections points
0 A 19
1 B 23
2 C 20
3 F 24
4 G 30
Right Anti-join:
sections points
5 F 24
6 G 30
This time, we get columns from the right data frame (df2
), excluding the intersection.