How to Extract Table From Webpage in Python
- Python Extract Table From Webpage
- Python Extract Table From Webpage Using Pandas
-
Python Extract Table From Webpage Using
lxml
The main aim of this article is to demonstrate how tables can be extracted from a webpage using Pandas and lxml
in Python.
Python Extract Table From Webpage
Data holds significant importance in this modern era, where much information is processed, stored, and extracted daily at a high-frequency rate. Regarding that, our project may demand that we extract data from a certain online location, repository, or webpage.
This can be a possibility in numerous use cases. Perhaps it is required to access a public record-keeping site to extract, process, and store data for numerous reasons or extract data from our repository.
There are better ways to do this, but for the sake of simplicity, let’s assume that this is the required way.
To tackle this situation, we must devise a solution that establishes the connection to the webpage, reads through the page, finds any tables (if they are present), correctly extracts them, and stores them in a suitable format. This is so that it can be processed in our program.
The solution to this problem can be approached in multiple ways, two of which are mentioned below:
Python Extract Table From Webpage Using Pandas
Before proceeding further, make sure you have the following modules/packages installed:
lxml
html5lib
BeautifulSoup4
You can install the mentioned packages with pip
using the following command:
pip install lxml html5lib BeautifulSoup4
This gives the following output:
Collecting lxml
Downloading lxml-4.9.1-cp310-cp310-win_amd64.whl (3.6 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 3.6/3.6 MB 37.9 kB/s eta 0:00:00
Requirement already satisfied: html5lib in c:\program files\python310\lib\site-packages (1.1)
Collecting BeautifulSoup4
Downloading beautifulsoup4-4.11.1-py3-none-any.whl (128 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 128.2/128.2 kB 29.0 kB/s eta 0:00:00
Requirement already satisfied: webencodings in c:\program files\python310\lib\site-packages (from html5lib) (0.5.1)
Requirement already satisfied: six>=1.9 in c:\program files\python310\lib\site-packages (from html5lib) (1.16.0)
Collecting soupsieve>1.2
Downloading soupsieve-2.3.2.post1-py3-none-any.whl (37 kB)
Collecting requests
Downloading requests-2.28.1-py3-none-any.whl (62 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 62.8/62.8 kB 420.8 kB/s eta 0:00:00
Collecting urllib3<1.27,>=1.21.1
Downloading urllib3-1.26.12-py2.py3-none-any.whl (140 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 140.4/140.4 kB 1.0 MB/s eta 0:00:00
Collecting idna<4,>=2.5
Downloading idna-3.4-py3-none-any.whl (61 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 61.5/61.5 kB 121.6 kB/s eta 0:00:00
Collecting charset-normalizer<3,>=2
Downloading charset_normalizer-2.1.1-py3-none-any.whl (39 kB)
Collecting certifi>=2017.4.17
Downloading certifi-2022.9.24-py3-none-any.whl (161 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 161.1/161.1 kB 1.4 MB/s eta 0:00:00
Installing collected packages: urllib3, idna, charset-normalizer, certifi, requests, soupsieve, lxml, BeautifulSoup4
Successfully installed BeautifulSoup4-4.11.1 lxml-4.9.1 soupsieve-2.3.2.post1 certifi-2022.9.24 charset-normalizer-2.1.1 idna-3.4 requests-2.28.1 urllib3-1.26.12
After the necessary modules have been installed, it’s time to move on to the implementation part.
Consider the following code:
import requests
import pandas as pd
url = "https://www.ffiec.gov/census/report.aspx?year=2020&county=009&state=09&report=demographic"
html = requests.get(url).content
df_list = pd.read_html(html)
df = df_list[-1]
print(df)
This gives the following output:
Tract Code Tract Income Level Distressed or Under-served Tract ... Minority Population Owner Occupied Units 1- to 4- Family Units
1 1202.0 Moderate No ... 3040 918 2010
2 1251.0 Middle No ... 551 1400 1555
3 1252.0 Moderate No ... 2088 1139 1992
4 1253.0 Moderate No ... 2443 728 1814
.. ... ... ... ... ... ... ...
95 1714.0 Moderate No ... 1278 141 638
96 1715.0 Moderate No ... 2241 396 1274
97 1716.0 Middle No ... 1466 1378 1803
98 1717.0 Middle No ... 820 1456 1647
99 1751.0 Middle No ... 851 669 1240
[100 rows x 12 columns]
It is easy to extract tables from any webpage using the Pandas library. The read_html
method of the Pandas library can be used to read and extract data from webpages and then convert them to dataframes to assist in the smooth processing of data, as they become Dataframe
objects.
The extracted tables can also be saved to a CSV file using the to_csv
method, which saves the Dataframe
object to a CSV file.
Python Extract Table From Webpage Using lxml
Consider the following code:
from lxml import etree
import urllib.request
site = "https://www.ffiec.gov/census/report.aspx?year=2020&county=009&state=09&report=demographic"
hdr = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Connection": "keep-alive",
}
request = urllib.request.Request(site, headers=hdr)
web = urllib.request.urlopen(request)
s = web.read()
html = etree.HTML(s)
# Get all 'tr'
tr_nodes = html.xpath('//table[@id="Report1_dgReportDemographic"]/tr')
# 'th' is inside first 'tr'
headers = [i.text for i in tr_nodes[0].xpath("th")]
# Get text from rest all 'tr'
td_content = [[td.text for td in tr.xpath("td")] for tr in tr_nodes[1:]]
for head in headers:
print(head, end=" ")
print("")
for content in td_content:
print(content)
This gives the output below:
Tract Code Tract Income Level Distressed or Under Tract Median Family Income % 2020 FFIEC Est. MSA/MD non-MSA/MD Median Family Income 2020 Est. Tract Median Family Income 2015 Tract Median Family Income Tract Population Tract Minority % Minority Population Owner Occupied Units 1- to 4- Family Units
['1201.00', 'Middle', 'No', '93.64', '$91,800', '$85,962', '$75,611', '6013', '26.44', '1590', '1862', '2248']
['1202.00', 'Moderate', 'No', '68.12', '$91,800', '$62,534', '$55,000', '6783', '44.82', '3040', '918', '2010']
['1251.00', 'Middle', 'No', '109.80', '$91,800', '$100,796', '$88,654', '4477', '12.31', '551', '1400', '1555']
['1252.00', 'Moderate', 'No', '62.55', '$91,800', '$57,421', '$50,506', '5912', '35.32', '2088', '1139', '1992']
['1253.00', 'Moderate', 'No', '57.28', '$91,800', '$52,583', '$46,250', '5164', '47.31', '2443', '728', '1814']
.
.
.
.
.
It is also possible to use lxml
and urllib
to extract tables from a webpage and process them. As evident from the code, we need to provide a custom header to the request; otherwise, a 403: Forbidden
error is received.
After the request is successful, a search for the table is done (specifically for this site), after which we manually extract the headers and content (rows) from the table.
Although this option is a bit more lengthy and complex than the Pandas module, it is useful when more control and freedom are required regarding what to extract and what not.
Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.
LinkedIn