How to Read an Excel File Using Python
-
Read an Excel File Using
pandas
Package in Python -
Read an Excel File Using the
xlrd
Package in Python - Examples of Tasks Performed Over Excel Files in Python
Python programming language is well known for its use in the field of data science. Data science generally involves dealing with data and analyzing it with the help of graphs and plots such as line plots, violin plots, histograms, and heat maps, and mathematical computations such as mean, median, mode, probability, variance, etc. What makes Python even more suitable is the fact that it makes file reading and manipulation very seamless. Since data is generally represented in popular file formats such as xls
, xlsx
, csv
, txt
, etc., handling them with Python is a piece of cake.
This article will introduce how to read excel files using Python with the help of some examples. For instance, we will consider a sample excel file that you can download from here so that we all are on the same page. Just rename it to sample.xls
for the following code snippets to work, or change the file name in the following code snippets itself.
Read an Excel File Using pandas
Package in Python
In Python, we can use the pandas
library to read an excel file. The pandas
module is a robust, powerful, fast, and flexible open-source data analysis and manipulation library written in Python. If you don’t have it installed on your machine or virtual environment, use the following command.
- To install
pandas
:pip install pandas
orpip3 install pandas
Refer to the following code for reading an excel file using the pandas
module.
import xlrd
import pandas
df = pandas.read_excel("sample.xls")
print("Columns")
print(df.columns)
Output:
Columns
Index(['Segment', 'Country', 'Product', 'Discount Band', 'Units Sold',
'Manufacturing Price', 'Sale Price', 'Gross Sales', 'Discounts',
' Sales', 'COGS', 'Profit', 'Date', 'Month Number', 'Month Name',
'Year'],
dtype='object')
Read an Excel File Using the xlrd
Package in Python
In Python, we can use the xlrd
package to read excel files. The xlrd
module is a Python package used for reading and formatting excel files. In case you don’t have it installed on your machine or virtual environment, use the following command.
- To install
xlrd
, use the following command.
pip install xlrd
Or,
pip3 install xlrd
Refer to the following code for reading excel files using xlrd
.
from xlrd import open_workbook
wb = open_workbook("sample.xls")
sheet = wb.sheet_by_index(0)
sheet.cell_value(0, 0)
columns = []
print("Columns")
for i in range(sheet.ncols):
columns.append(sheet.cell_value(0, i))
print(columns)
Output:
Columns
['Segment', 'Country', 'Product', 'Discount Band', 'Units Sold', 'Manufacturing Price', 'Sale Price', 'Gross Sales', 'Discounts', ' Sales', 'COGS', 'Profit', 'Date', 'Month Number', 'Month Name', 'Year']
Here’s a brief explanation of what the above code does. It first creates a file descriptor for the excel file with the help of the open_workbook()
function. Then it resets the file pointer to the (0,0)
position or the top-left cell. Next, it iterates over the first row and stores all the column names in a variable. Generally, columns name are present in the first row; that is why the code considers that location. In case the column names are on some different row, one can change the 0
value in the statement sheet.cell_value(0, i)
to whatever row number they wish to. Essentially, (0, i)
represents y
and x
coordinates, where y
is 0
, and x
is i
, considering origin (0, 0)
to be present at the top-left corner of the file.
Examples of Tasks Performed Over Excel Files in Python
Let’s look at some simple tasks that we can perform over excel files to understand these two libraries better.
Printing the First 3 Rows of an Excel File
Using the pandas
package
import pandas
df = pandas.read_excel("sample.xls")
count = 3
for index, row in df.iterrows():
print(row, end="\n\n")
if index == count - 1:
break
Output:
Segment Government
Country Canada
Product Carretera
Discount Band None
Units Sold 1618.5
Manufacturing Price 3
Sale Price 20
Gross Sales 32370.0
Discounts 0.0
Sales 32370.0
COGS 16185.0
Profit 16185.0
Date 2014-01-01 00:00:00
Month Number 1
Month Name January
Year 2014
Name: 0, dtype: object
Segment Government
Country Germany
Product Carretera
Discount Band None
Units Sold 1321.0
Manufacturing Price 3
Sale Price 20
Gross Sales 26420.0
Discounts 0.0
Sales 26420.0
COGS 13210.0
Profit 13210.0
Date 2014-01-01 00:00:00
Month Number 1
Month Name January
Year 2014
Name: 1, dtype: object
Segment Midmarket
Country France
Product Carretera
Discount Band None
Units Sold 2178.0
Manufacturing Price 3
Sale Price 15
Gross Sales 32670.0
Discounts 0.0
Sales 32670.0
COGS 21780.0
Profit 10890.0
Date 2014-06-01 00:00:00
Month Number 6
Month Name June
Year 2014
Name: 2, dtype: object
Using the xlrd
package
from xlrd import open_workbook
wb = open_workbook("sample.xls")
sheet = wb.sheet_by_index(0)
sheet.cell_value(0, 0)
count = 3
for i in range(1, count + 1):
for j in range(sheet.ncols):
print(sheet.cell_value(i, j), end=", ")
print()
Output:
Government, Canada, Carretera, None, 1618.5, 3.0, 20.0, 32370.0, 0.0, 32370.0, 16185.0, 16185.0, 41640.0, 1.0, January, 2014,
Government, Germany, Carretera, None, 1321.0, 3.0, 20.0, 26420.0, 0.0, 26420.0, 13210.0, 13210.0, 41640.0, 1.0, January, 2014,
Midmarket, France, Carretera, None, 2178.0, 3.0, 15.0, 32670.0, 0.0, 32670.0, 21780.0, 10890.0, 41791.0, 6.0, June, 2014,
Printing Values of a Specific Column
Using the pandas
package
import pandas
df = pandas.read_excel("sample.xls")
column = df.columns[4]
print(column)
print("-" * len(column))
for index, row in df.iterrows():
print(row[column])
Output:
Units Sold
----------
1618.5
1321.0
2178.0
888.0
2470.0
1513.0
921.0
2518.0
1899.0
1545.0
2470.0
2665.5
958.0
2146.0
345.0
615.0
292.0
974.0
2518.0
1006.0
367.0
883.0
549.0
788.0
2472.0
1143.0
1725.0
912.0
2152.0
1817.0
1513.0
1493.0
1804.0
2161.0
1006.0
1545.0
2821.0
345.0
2001.0
2838.0
2178.0
888.0
...
Using the xlrd
package
from xlrd import open_workbook
wb = open_workbook("sample.xls")
sheet = wb.sheet_by_index(0)
sheet.cell_value(0, 0)
column_index = 4
column = sheet.cell_value(0, column_index)
print(column)
print("-" * len(column))
for row in range(1, sheet.nrows):
print(sheet.cell_value(row, column_index))
Output:
Units Sold
----------
1618.5
1321.0
2178.0
888.0
2470.0
1513.0
921.0
2518.0
1899.0
1545.0
2470.0
2665.5
958.0
2146.0
345.0
615.0
292.0
974.0
2518.0
1006.0
367.0
883.0
549.0
788.0
2472.0
1143.0
1725.0
912.0
2152.0
1817.0
1513.0
1493.0
1804.0
2161.0
1006.0
1545.0
2821.0
345.0
2001.0
2838.0
2178.0
888.0
...