How to Compare Two CSV Files and Print Differences Using Python
- Method 1: Compare Two CSV Files Using the Most Pythonic Solution
-
Method 2: Compare Two CSV Files Using
csv-diff
- An External Module - Method 3: Compare Two CSV Files Using Pandas DataFrames
This article will discuss various methods of comparing two CSV files. We will include the most “Pythonic” way of performing this operation and an external Python module that can help simplify this task.
Lastly, we will include a method using Pandas DataFrames to identify differences in the CSV files.
We will assume that the two CSV files we need to compare are titled file1.csv
and file2.csv
. You can rename the files as you see fit.
Please also replace the file names appropriately in the code snippets given below.
For example purposes, we have our files setup as follows:
file1.csv
:
1,2,3,4,5,6
4,5,6,7,8,9
1,3,4,5,6,1
file2.csv
:
1,2,3,4,5,6
4,5,6,7,8,9
2,3,1,4,1,5
Method 1: Compare Two CSV Files Using the Most Pythonic Solution
In this method, we read the file’s contents into two lists, iterate over one of the lists and check whether or not each of the lines exists in the second list. Logically, this is a very simple solution.
Python’s underlying efficiencies make this comparison fairly efficient, despite what it looks like.
with open("file1.csv", "r") as file1, open("file2.csv", "r") as file2:
f1_contents = file1.readlines()
f2_contents = file2.readlines()
for line in f1_contents:
if line not in f2_contents:
print(line)
for line in f2_contents:
if line not in f1_contents:
print(line)
The above code snippet will print the differing lines to your terminal.
In our test case, we get the following as output.
1,3,4,5,6,1
2,3,1,4,1,5
Method 2: Compare Two CSV Files Using csv-diff
- An External Module
Firstly, install the module using the following command in your terminal.
python3 -m pip install csv-diff
Once installed, you do not need to write a Python script. You can run this directly in the terminal with the following command.
csv-diff file1.csv file2.csv --key=id
Running this command will display the differences on your terminal.
In our test case, we get the following as output.
1 row added, 1 row removed
1 row added
1: 2
2: 3
3: 1
4: 4
5: 1
6: 5
1 row removed
1: 1
2: 3
3: 4
4: 5
5: 6
6: 1
To use this module as part of a Python script, you can write a script similar to the following.
from csv_diff import load_csv, compare
difference = compare(load_csv(open("file1.csv")), load_csv(open("file2.csv")))
print(difference)
The output for this will be the following.
{'added': [{'1': '2', '2': '3', '3': '1', '4': '4', '5': '1', '6': '5'}], 'removed': [{'1': '1', '2': '3', '3': '4', '4': '5', '5': '6', '6': '1'}], 'changed': [], 'columns_added': [], 'columns_removed': []}
Method 3: Compare Two CSV Files Using Pandas DataFrames
The following script can perform this task for you.
import pandas as pd
import sys
import csv
def dataframe_difference(df1: pd.DataFrame, df2: pd.DataFrame, which=None):
comparison_df = df1.merge(df2, indicator=True, how="outer")
if which is None:
diff_df = comparison_df[comparison_df["_merge"] != "both"]
else:
diff_df = comparison_df[comparison_df["_merge"] == which]
return diff_df
if __name__ == "__main__":
df1 = pd.read_csv("file1.csv", header=None)
df2 = pd.read_csv("file2.csv", header=None)
print(dataframe_difference(df1, df2))
Please note that in the read_csv
method, the argument header=None
is entered because our test file does not have any header. If your file has a header, you can read it using: pd.read_csv("file1.csv")
, where file1.csv
will be replaced by your file instead.
In case your file is not present in the same directory as your script, please provide the full path to your CSV files.
The above Python scripts should generate an output like:
0 1 2 3 4 5 _merge
2 1 3 4 5 6 1 left_only
3 2 3 1 4 1 5 right_only
The lines next to left_only
and right_only
contain all the differences. The line next to _merge
only represents indices.