How to Read GZ File in Pandas
If you are a Python maniac and use Python for data analysis and processing, you may be interested in reading a gz
file as a Pandas data frame using Python. This tutorial educates about a possible way to read a gz
file as a data frame using a Python library called pandas
.
Read GZ File in Pandas
gz
is a file extension for compressed files that are compressed by the standard GNU zip (gzip
) compression algorithm. It is widely used as a compression format for Linux and Unix operating systems; for instance, if you have a file for an email, you can use the gz
file format to compress the file into a smaller file.
Large data files are compressed using a compression algorithm, and to use this data; the user has to read the content in an organized structure.
Python library; Pandas has a data type called a data frame, which is an integral part of the Python and NumPy
ecosystem, making them faster, easier to use, and more powerful than tables and spreadsheets.
A data frame is a data structure used to represent two-dimensional, resizable, potentially heterogeneous tabular data. It contains labeled axes (rows and columns).
Arithmetic operations are placed in both row and column labels. It is a dict-like container for series objects, spreadsheets or SQL tables.
So, if we are interested in reading a gz
file as a Pandas data frame using Python, then note that we cannot read the .gz
file directly we need to arrange the file’s data into an organized format using Python.
So, how to read the .gz
file? For that, we will have to follow the steps given below.
-
State the absolute path of the
gz
file and subsequent attributes for file reading. -
Use the
read_csv()
method from thepandas
module and pass the parameter. -
Use pandas
DataFrame
to view and manipulate the data of thegz
file.
Use Pandas Data Frame to Read gz
File
Suppose we want to read a gz
compressed file for a CSV file 50_Startups.csv
.
path_gzip_file = "F:/50_Startups.csv.gz"
Let’s run the following code to do it.
Example Code (saved in demo.py
):
import pandas as pd
path_gzip_file = "F:/50_Startups.csv.gz"
gzip_file_data_frame = pd.read_csv(
path_gzip_file, compression="gzip", header=0, sep=",", quotechar='"'
)
print(gzip_file_data_frame.head(5))
First, we import the pandas
module and alias it as pd
to work with data frames and to read files. Next, we specify an absolute path of our gz
file.
After that, we call the pd.read_csv()
method of the pandas
module and pass parameters. The pd.read_csv
takes multiple parameters and returns a pandas
data frame.
We pass five parameters that are listed below.
- The first one is a string
path
object. - The second is the string
compression
type (in this case,gzip
). - The third parameter is an integer called
header
(explicitly passed asheader=0
to allow for the replacement of existing names. The header can be a list of integers specifying row positions for multiple indices of the column -[0,1,3]
). - The fourth is the string
delimiter
(in this case,,
). - The fifth one is
quotechar
, an optional length1
string (Characters used to mark the beginning and end of quoted items. Quoted items can contain delimiters and are ignored.)
Finally, we chain the data frame with the head()
function that takes one parameter n
, returns the first n
number of data rows and then prints the data.
Now, we run the above code as follows:
PS F:\> & C:/Python310/python.exe f:/demo.py
Our 50_Startups.csv.gz
file is read successfully. See the first 5 rows from the file content below.
R&D Spend Administration Marketing Spend State Profit
0 165349.20 136897.80 471784.10 New York 192261.83
1 162597.70 151377.59 443898.53 California 191792.06
2 153441.51 101145.55 407934.54 Florida 191050.39
3 144372.41 118671.85 383199.62 New York 182901.99
4 142107.34 91391.77 366168.42 Florida 166187.94