How to Remove HTML Tags From a String in Python
- Use Regex to Remove HTML Tags From a String in Python
-
Use
BeautifulSoup
to Remove HTML Tags From a String in Python -
Use
xml.etree.ElementTree
to Remove HTML Tags From a String in Python
In this guide, we will learn and apply a few methods to remove HTML tags from a string. We will use the regex, BeautifulSoup
, and the XML element tree.
Use Regex to Remove HTML Tags From a String in Python
As HTML tags always contain the symbol <>
. We will import the built-in re
module (regular expression) and use the compile()
method to search for the defined pattern in the input string.
Here, the pattern <.*?>
means zero or more characters inside the tag <>
and matches as few as possible.
The sub()
method is used to replace the occurrences of a string with another string. Here, it will replace the found occurrences with an empty string.
Example Code:
# Python 3.x
import re
string = "<h1>Delftstack</h1>"
print("String before cleaning:", string)
to_clean = re.compile("<.*?>")
cleantext = re.sub(to_clean, "", string)
print("String after cleaning:", cleantext)
Output:
#Python 3.x
String before cleaning: <h1>Delftstack</h1>
String after cleaning: Delftstack
Use BeautifulSoup
to Remove HTML Tags From a String in Python
BeautifulSoup
is a Python library to get the data from HTML and XML. It uses a parser to parse the HTML and XML; recommended one is lxml
.
We need to install both before proceeding, using the following commands:
#Python 3.x
pip install beautifulsoup4
#Python 3.x
pip install lxml
We imported the BeautifulSoup
module and parsed the given HTML string in the following code. We accessed the text from the HTML using the text
attribute.
Example Code:
# Python 3.x
from bs4 import BeautifulSoup
string = "<h1>Delftstack</h1>"
print("String after cleaning:", string)
cleantext = BeautifulSoup(string, "lxml").text
print("String after cleaning:", cleantext)
Output:
#Python 3.x
String after cleaning: <h1>Delftstack</h1>
String after cleaning: Delftstack
Use xml.etree.ElementTree
to Remove HTML Tags From a String in Python
The ElementTree is a library that parses and navigates through XML. The fromstring()
method parses the XML directly from a string to an element, which is a root element of the parse tree.
The itertext()
produces a text iterator that loops over this element and all its sub-elements in document order, returning all inner text. By merging all the components (inner text) of an iterable (input string), separated by a string separator, the join()
method returns a string that is free from HTML tags.
Example Code:
# Python 3.x
import xml.etree.ElementTree as ET
string = "<h1>Delftstack</h1>"
print("String before cleaning:", string)
tree = ET.fromstring(string)
print("String after cleaning:", "".join(tree.itertext()))
Output:
#Python 3.x
String before cleaning: <h1>Delftstack</h1>
String after cleaning: Delftstack
I am Fariba Laiq from Pakistan. An android app developer, technical content writer, and coding instructor. Writing has always been one of my passions. I love to learn, implement and convey my knowledge to others.
LinkedIn