How to Create an XML Parser in Python
-
Use the
ElementTree
API to Parse an XML Document in Python -
Use the
minidom
Module to Parse an XML Document in Python -
Use the
Beautiful Soup
Library to Parse an XML Document in Python -
Use the
xmltodict
Library to Parse an XML Document in Python -
Use the
lxml
Library to Parse an XML Document in Python -
Use the
untangle
Module to Parse an XML Document in Python -
Use the
declxml
Library to Parse an XML Document in Python
XML is an abbreviation for eXtensible Markup Language and is a self-descriptive language utilized to store and transport data. Python provides a medium for parsing and modification of an XML document.
This tutorial focuses on and demonstrates different methods to parse an XML document in Python.
Use the ElementTree
API to Parse an XML Document in Python
The xml.etree.ElementTree
module is utilized to generate an efficient yet simple API to parse the XML document and create XML data.
The following code uses the xml.etree.ElementTree
module to parse an XML document in Python.
# >= Python 3.3 code
import xml.etree.ElementTree as ET
file1 = """<foo>
<bar>
<type foobar="Hello"/>
<type foobar="God"/>
</bar>
</foo>"""
tree = ET.fromstring(file1)
x = tree.findall("bar/type")
for item in x:
print(item.get("foobar"))
Output:
Hello
God
Here, we pass the XML data as a string within triple quotes. We can also import an actual XML document with the help of the parse()
function of the ElementTree
module.
The cElementTree
module was the C implementation of the ElementTree
API, with the only difference being that cElementTree
is optimized. With that being said, it can parse about 15-20 times faster than the ElementTree
module and uses a very low amount of memory.
However, in Python 3.3 and above, the cElementTree
module has been deprecated, and the ElementTree
module uses a faster implementation.
Use the minidom
Module to Parse an XML Document in Python
The xml.dom.minidom
can be defined as a basic implementation of the Document Object Model (DOM) interface. All the DOM applications usually begin with the parsing of an XML object. Therefore, this method is the quickest method to parse an XML document in Python.
The following code uses the parse()
function from the minidom
module to parse an XML document in Python.
XML File (sample1.xml):
<data>
<strings>
<string name="Hello"></string>
<string name="God"></string>
</strings>
</data>
Python Code:
from xml.dom import minidom
xmldoc = minidom.parse("sample1.xml")
stringlist = xmldoc.getElementsByTagName("string")
print(len(stringlist))
print(stringlist[0].attributes["name"].value)
for x in stringlist:
print(x.attributes["name"].value)
Output:
2
Hello
God
This module also allows the XML to be passed as a string, similar to the ElementTree
API. However, it uses the parseString()
function to achieve this.
Both the xml.etree.ElementTree
and xml.dom.minidom
modules are said to be not safe against maliciously constructed data.
Use the Beautiful Soup
Library to Parse an XML Document in Python
The Beautiful Soup
library is designed for web scraping projects and pulling the data out from XML
and HTML
files. Beautiful Soup
is really fast and can parse anything that it encounters.
This library even does the tree traversal process for the program and parses the XML document. Additionally, Beautiful Soup
is also used to prettify the given source code.
The Beautiful Soup
library needs to be manually installed and then imported to the Python code for this method. This library can be installed using the pip
command. The Beautiful Soup 4
library, which is the latest version, works on Python 2.7 and above.
The following code uses the Beautiful Soup
library to parse an XML document in Python.
from bs4 import BeautifulSoup
file1 = """<foo>
<bar>
<type foobar="Hello"/>
<type foobar="God"/>
</bar>
</foo>"""
a = BeautifulSoup(file1)
print(a.foo.bar.type["foobar"])
print(a.foo.bar.findAll("type"))
Output:
u'Hello'
[<type foobar="Hello"></type>, <type foobar="God"></type>]
Beautiful Soup
is faster than any other tools used for parsing, but it might be hard to understand and implement this method sometimes.
Use the xmltodict
Library to Parse an XML Document in Python
The xmltodict
library helps in making the process on XML files similar to that of JSON. It can also be used in the case when we want to parse an XML file. The xmltodict
module can be utilized in this case by parsing an XML file to an ordered dictionary.
The xmltodict
library needs to be manually installed and then imported into the Python code that contains the XML file. The installation of xmltodict
is pretty basic and can be done using the standard pip
command.
The following code uses the xmltodict
library to parse an XML document in Python.
import xmltodict
file1 = """<foo>
<bar>
<type foobar="Hello"/>
<type foobar="God"/>
</bar>
</foo> """
result = xmltodict.parse(file1)
print(result)
Output:
OrderedDict([(u'foo', OrderedDict([(u'bar', OrderedDict([(u'type', [OrderedDict([(u'@foobar', u'Hello')]), OrderedDict([(u'@foobar', u'God')])])]))]))])
Use the lxml
Library to Parse an XML Document in Python
The lxml
library is able to provide a simple yet very powerful API in Python used to parse XML and HTML files. It combines the ElementTree
API with libxml2/libxslt
.
In simpler words, the lxml
library further extends the old ElementTree
library to offer support for much newer things like XML Schema, XPath, and XSLT.
Here, we will use the lxml.objectify
library. The following code uses the lxml
library to parse an XML document in Python.
from collections import defaultdict
from lxml import objectify
file1 = """<foo>
<bar>
<type foobar="1"/>
<type foobar="2"/>
</bar>
</foo>"""
c = defaultdict(int)
root = objectify.fromstring(file1)
for item in root.bar.type:
c[item.attrib.get("foobar")] += 1
print(dict(c))
Output:
{'1': 1, '2': 1}
Here, in this program, the c
variable is used to store the count of each item available in a dictionary.
Use the untangle
Module to Parse an XML Document in Python
The untangle
module is an easy-to-implement module that focuses on converting XML into a Python Object. It can also be easily installed using the pip
command. This module works with Python 2.7 and above.
The following code uses the untangle
module to parse an XML document in Python.
XML File (sample1.xml):
<foo>
<bar>
<type foobar="Hello"/>
</bar>
</foo>
Python code:
import untangle
x = untangle.parse("/path_to_xml_file/sample1.xml")
print(x.foo.bar.type["foobar"])
Output:
Hello
Use the declxml
Library to Parse an XML Document in Python
The declxml
library, an abbreviation for Declarative XML Processing, is utilized to provide a simple API to serialize and parsing XML documents. This library aims to reduce the programmer’s workload and replace the need to go through big and long chunks of code of the parsing logic requisite when using other popular APIs, such as minidom
or ElementTree
.
The declxml
module can be installed easily in the system by using the pip
or the pipenv
command. The following code uses the declxml
library to parse an XML document in Python.
import declxml as xml
xml_string = """
<foo>
<bar>
<type foobar="1"/>
<type foobar="3"/>
<type foobar="5"/>
</bar>
</foo>
"""
processor = xml.dictionary(
"foo", [xml.dictionary("bar", [xml.array(xml.integer("type", attribute="foobar"))])]
)
xml.parse_from_string(processor, xml_string)
Output:
{'bar': {'foobar': [1, 3, 5]}}
In this method, we use processors for declaratively characterizing the structure of the given XML document and for mapping between XML and Python data structures.
Vaibhhav is an IT professional who has a strong-hold in Python programming and various projects under his belt. He has an eagerness to discover new things and is a quick learner.
LinkedIn