How to Solve Urllib HTTP Error 403 Forbidden Message in Python
-
the
urllib
Module in Python -
Check
robots.txt
to Preventurllib
HTTP Error 403 Forbidden Message -
Adding Cookie to the Request Headers to Solve
urllib
HTTP Error 403 Forbidden Message -
Use Session Object to Solve
urllib
HTTP Error 403 Forbidden Message
Today’s article explains how to deal with an error message (exception), urllib.error.HTTPError: HTTP Error 403: Forbidden
, produced by the error
class on behalf of the request
classes when it faces a forbidden resource.
the urllib
Module in Python
The urllib
Python module handles URLs for python via different protocols. It is famous for web scrapers who want to obtain data from a particular website.
The urllib
contains classes, methods, and functions that perform certain operations such as reading, parsing URLs, and robots.txt
. There are four classes, request, error, parse, and robotparser.
Check robots.txt
to Prevent urllib
HTTP Error 403 Forbidden Message
When using the urllib
module to interact with clients or servers via the request
class, we might experience specific errors. One of those errors is the HTTP 403
error.
We get urllib.error.HTTPError: HTTP Error 403: Forbidden
error message in urllib
package while reading a URL. The HTTP 403
, the Forbidden Error, is an HTTP status code that indicates that the client or server forbids access to a requested resource.
Therefore, when we see this kind of error message, urllib.error.HTTPError: HTTP Error 403: Forbidden
, the server understands the request but decides not to process or authorize the request that we sent.
To understand why the website we are accessing is not processing our request, we need to check an important file, robots.txt
. Before web scraping or interacting with a website, it is often advised to review this file to know what to expect and not face any further troubles.
To check it on any website, we can follow the format below.
https://<website.com>/robots.txt
For example, check YouTube, Amazon, and Google robots.txt
files.
https://www.youtube.com/robots.txt
https://www.amazon.com/robots.txt
https://www.google.com/robots.txt
Checking YouTube robots.txt
gives the following result.
# robots.txt file for YouTube
# Created in the distant future (the year 2000) after
# the robotic uprising of the mid-'90s wiped out all humans.
User-agent: Mediapartners-Google*
Disallow:
User-agent: *
Disallow: /channel/*/community
Disallow: /comment
Disallow: /get_video
Disallow: /get_video_info
Disallow: /get_midroll_info
Disallow: /live_chat
Disallow: /login
Disallow: /results
Disallow: /signup
Disallow: /t/terms
Disallow: /timedtext_video
Disallow: /user/*/community
Disallow: /verify_age
Disallow: /watch_ajax
Disallow: /watch_fragments_ajax
Disallow: /watch_popup
Disallow: /watch_queue_ajax
Sitemap: https://www.youtube.com/sitemaps/sitemap.xml
Sitemap: https://www.youtube.com/product/sitemap.xml
We can notice a lot of Disallow
tags there. This Disallow
tag shows the website’s area, which is not accessible. Therefore, any request to those areas will not be processed and is forbidden.
In other robots.txt
files, we might see an Allow
tag. For example, http://youtube.com/comment
is forbidden to any external request, even with the urllib
module.
Let’s write code to scrape data from a website that returns an HTTP 403
error when accessed.
Example Code:
import urllib.request
import re
webpage = urllib.request.urlopen(
"https://www.cmegroup.com/markets/products.html?redirect=/trading/products/#cleared=Options&sortField=oi"
).read()
findrows = re.compile('<tr class="- banding(?:On|Off)>(.*?)</tr>')
findlink = re.compile('<a href =">(.*)</a>')
row_array = re.findall(findrows, webpage)
links = re.findall(findlink, webpage)
print(len(row_array))
Output:
Traceback (most recent call last):
File "c:\Users\akinl\Documents\Python\index.py", line 7, in <module>
webpage = urllib.request.urlopen('https://www.cmegroup.com/markets/products.html?redirect=/trading/products/#cleared=Options&sortField=oi').read()
File "C:\Python310\lib\urllib\request.py", line 216, in urlopen
return opener.open(url, data, timeout)
File "C:\Python310\lib\urllib\request.py", line 525, in open
response = meth(req, response)
File "C:\Python310\lib\urllib\request.py", line 634, in http_response
response = self.parent.error(
File "C:\Python310\lib\urllib\request.py", line 563, in error
return self._call_chain(*args)
File "C:\Python310\lib\urllib\request.py", line 496, in _call_chain
result = func(*args)
File "C:\Python310\lib\urllib\request.py", line 643, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 403: Forbidden
The reason is that we are forbidden from accessing the website. However, if we check the robots.txt
file, we will notice that https://www.cmegroup.com/markets/
is not with a Disallow
tag. However, if we go down the robots.txt
file for the website we wanted to scrape, we will find the below.
User-agent: Python-urllib/1.17
Disallow: /
The above text means that the user agent named Python-urllib
is not allowed to crawl any URL within the site. That means using the Python urllib
module is not allowed to crawl the site.
Therefore, check or parse the robots.txt
to know what resources we have access to. we can parse robots.txt
file using the robotparser class. These can prevent our code from experiencing an urllib.error.HTTPError: HTTP Error 403: Forbidden
error message.
Adding Cookie to the Request Headers to Solve urllib
HTTP Error 403 Forbidden Message
Passing a valid user agent as a header parameter will quickly fix the problem. The website may use cookies as an anti-scraping measure.
The website may set and ask for cookies to be echoed back to prevent scraping, which is maybe against its policy.
from urllib.request import Request, urlopen
def get_page_content(url, head):
req = Request(url, headers=head)
return urlopen(req)
url = "https://example.com"
head = {
"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Safari/537.36",
"Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"Accept-Charset": "ISO-8859-1,utf-8;q=0.7,*;q=0.3",
"Accept-Encoding": "none",
"Accept-Language": "en-US,en;q=0.8",
"Connection": "keep-alive",
"refere": "https://example.com",
"cookie": """your cookie value ( you can get that from your web page) """,
}
data = get_page_content(url, head).read()
print(data)
Output:
<!doctype html>\n<html>\n<head>\n <title>Example Domain</title>\n\n <meta
'
'
'
<p><a href="https://www.iana.org/domains/example">More information...</a></p>\n</div>\n</body>\n</html>\n'
Passing a valid user agent as a header parameter will quickly fix the problem.
Use Session Object to Solve urllib
HTTP Error 403 Forbidden Message
Sometimes, even using a user agent won’t stop this error from occurring. The Session
object of the requests
module can then be used.
from random import seed
import requests
url = "https://stackoverflow.com/search?q=html+error+403"
session_obj = requests.Session()
response = session_obj.get(url, headers={"User-Agent": "Mozilla/5.0"})
print(response.status_code)
Output:
200
The above article finds the cause of the urllib.error.HTTPError: HTTP Error 403: Forbidden
and the solution to handle it. mod_security
basically causes this error as different web pages use different security mechanisms to differentiate between human and automated computers (bots).
Olorunfemi is a lover of technology and computers. In addition, I write technology and coding content for developers and hobbyists. When not working, I learn to design, among other things.
LinkedInRelated Article - Python Error
- Can Only Concatenate List (Not Int) to List in Python
- How to Fix Value Error Need More Than One Value to Unpack in Python
- How to Fix ValueError Arrays Must All Be the Same Length in Python
- Invalid Syntax in Python
- How to Fix the TypeError: Object of Type 'Int64' Is Not JSON Serializable
- How to Fix the TypeError: 'float' Object Cannot Be Interpreted as an Integer in Python