How to Convert XML File to Pandas DataFrame
This tutorial will introduce how an XML file is converted into Python Pandas DataFrame. We will use the xml.etree.ElementTree
library.
Convert XML File to Pandas DataFrame
XML is a markup language used to express and disseminate data structures that are sometimes difficult to generate using more traditional tabular formats.
Below are the steps to convert an XML file into a Pandas dataframe.
Create an XML File
<data>
<student name="John">
<email>john@mail.com</email>
<grade>A</grade>
<age>16</age>
</student>
<student name="Alice">
<email>alice@mail.com</email>
<grade>B</grade>
<age>17</age>
</student>
<student name="Bob">
<email>bob@mail.com</email>
<grade>C</grade>
<age>16</age>
</student>
<student name="Hannah">
<email>hannah@mail.com</email>
<grade>A</grade>
<age>17</age>
</student>
</data>
Each student is represented in this example by a <student>
element with a name attribute containing the student’s name.
Each of these elements is further subdivided into sub-elements defined by the <email>
, <grade>
, and <age>
tags, with the actual data content referring to the given student falling between these tags.
Assume that this data is saved in an XML file called students.xml
.
import xml.etree.ElementTree as et
xtree = et.parse("/content/drive/MyDrive/ABC/student.xml")
xroot = xtree.getroot()
Loop Over the Tree
We can now loop over the tree, grabbing each student element, its name property, and its sub-elements to generate our dataframe.
for node in xroot:
s_name = node.attrib.get("name")
s_mail = node.find("email").text
s_grade = node.find("grade").text
s_age = node.find("age").text
The attrib.get()
function is used to obtain the name attribute, while the text content of each element may be acquired using the node’s find()
function.
Each iteration will produce a data set that can be considered an observation in a Pandas dataframe. This process may be built as follows:
import pandas as pd
import xml.etree.ElementTree as et
xtree = et.parse("/content/drive/MyDrive/ABC/student.xml")
xroot = xtree.getroot()
df_cols = ["name", "email", "grade", "age"]
rows = []
for node in xroot:
s_name = node.attrib.get("name")
s_mail = node.find("email").text if node is not None else None
s_grade = node.find("grade").text if node is not None else None
s_age = node.find("age").text if node is not None else None
rows.append({"name": s_name, "email": s_mail, "grade": s_grade, "age": s_age})
out_df = pd.DataFrame(rows, columns=df_cols)
The disadvantage of using this technique is that you must know the structure of the XML file ahead of time and hard-code column names accordingly. We can try to modify this code to a more practical and adaptable function that does not require any hard-coded values:
out_df
When we use parse XML("students.xml", ["name", "email", "grade", "age"])
on the students.xml
file, we get the table shown below.
name email grade age
0 John john@mail.com A 16
1 Alice alice@mail.com B 17
2 Bob bob@mail.com C 16
3 Hannah hannah@mail.com A 17
Conclusion
Although we still need to know the fundamental structure of the input XML document we wish to parse, this is a more efficient implementation of the XML parsing function.