How to Read XLSX Files Using Python and Pandas
Pandas is a very robust and optimized library that helps handle data efficiently, perform operations over that data with ease, and analyze it with all sorts of plots and graphs. Common operations such as merging, concatenation, and reshaping can be performed very easily using pandas.
Pandas also supports reading files and storing the data inside the files into various objects such as arrays and DataFrames
. One such file is .xlsx
. .xlsx
files are Microsoft Excel Open XML Format Spreadsheet files that are compressed and XML based. This article will talk about how to read .xlsx
files using pandas.
Reading .xlsx
Files Using pandas
in Python
To read .xlsx
files using pandas
, we can use the read_excel()
function. This function reads an excel file into a pandas Dataframe
. And, we can use this function to read xlsx
, xls
, xlsm
, xlsb
, odf
, ods
, and odt
files. Since excel files can contain multiple sheets, this function can read a single and multiple sheets.
The read_excel()
function has a couple of parameters. We will discuss some main ones.
io
: This parameter can bestr
,bytes
,ExcelFile
,xlrd.Book
,path object
, orfile-like object
. It is essentially the file or the object that has to be read. One can also pass a valid file path in the form of a string to this parameter.sheet_name
: This parameter can bestr
,int
,list
, orNone
. The default value is0
. Since excel files are made up of sheets, we can specify a single sheet or a bunch of sheets to read. We can not only use integer indexes (2
,0
,[1, 3, 7]
) but also string names to specify the sheets (Sheet1
,Sheet3
,["Sheet4", "Sheet5", "Sheet8"]
).header
: The row that should be considered as the header for the excel file. The default value is0
because the first row is usually the header and considers zero-based indexing. If there is no header, set it toNone
.names
: We can specify what columns we wish to read from the excel file. It accepts a list of column names. The default value isNone
. Since column names are found inside the header, the file should have one. Otherwise, just setheader = None
.
Now let us see how we can use this function to read a .xlxs
file. Refer to the following code.
file_name = "my_file.xlsx" # File name
sheet_name = 3 # 4th sheet
header = 1 # The header is the 2nd row
df = pd.read_excel(file_name=file_name, sheet_name=sheet_name, header=header)
print(df.head()) # Prints first 5 rows from the top along with the header
print(df.tail()) # Prints first 5 rows from the bottom along with the header