How to Convert XML File to Pandas DataFrame

Luqman Khan Feb 02, 2024
  1. Convert XML File to Pandas DataFrame
  2. Conclusion
How to Convert XML File to Pandas DataFrame

This tutorial will introduce how an XML file is converted into Python Pandas DataFrame. We will use the xml.etree.ElementTree library.

Convert XML File to Pandas DataFrame

XML is a markup language used to express and disseminate data structures that are sometimes difficult to generate using more traditional tabular formats.

Below are the steps to convert an XML file into a Pandas dataframe.

Create an XML File

<data>
    <student name="John">
        <email>john@mail.com</email>
        <grade>A</grade>
        <age>16</age>
    </student>
    <student name="Alice">
        <email>alice@mail.com</email>
        <grade>B</grade>
        <age>17</age>
    </student>
    <student name="Bob">
        <email>bob@mail.com</email>
        <grade>C</grade>
        <age>16</age>
    </student>
    <student name="Hannah">
        <email>hannah@mail.com</email>
        <grade>A</grade>
        <age>17</age>
    </student>
</data>

Each student is represented in this example by a <student> element with a name attribute containing the student’s name.

Each of these elements is further subdivided into sub-elements defined by the <email>, <grade>, and <age> tags, with the actual data content referring to the given student falling between these tags.

Assume that this data is saved in an XML file called students.xml.

import xml.etree.ElementTree as et

xtree = et.parse("/content/drive/MyDrive/ABC/student.xml")
xroot = xtree.getroot()

Loop Over the Tree

We can now loop over the tree, grabbing each student element, its name property, and its sub-elements to generate our dataframe.

for node in xroot:
    s_name = node.attrib.get("name")
    s_mail = node.find("email").text
    s_grade = node.find("grade").text
    s_age = node.find("age").text

The attrib.get() function is used to obtain the name attribute, while the text content of each element may be acquired using the node’s find() function.

Each iteration will produce a data set that can be considered an observation in a Pandas dataframe. This process may be built as follows:

import pandas as pd
import xml.etree.ElementTree as et

xtree = et.parse("/content/drive/MyDrive/ABC/student.xml")
xroot = xtree.getroot()

df_cols = ["name", "email", "grade", "age"]
rows = []

for node in xroot:
    s_name = node.attrib.get("name")
    s_mail = node.find("email").text if node is not None else None
    s_grade = node.find("grade").text if node is not None else None
    s_age = node.find("age").text if node is not None else None

    rows.append({"name": s_name, "email": s_mail, "grade": s_grade, "age": s_age})

out_df = pd.DataFrame(rows, columns=df_cols)

The disadvantage of using this technique is that you must know the structure of the XML file ahead of time and hard-code column names accordingly. We can try to modify this code to a more practical and adaptable function that does not require any hard-coded values:

out_df

When we use parse XML("students.xml", ["name", "email", "grade", "age"]) on the students.xml file, we get the table shown below.

     name     email             grade   age
0    John     john@mail.com     A       16
1    Alice    alice@mail.com    B       17
2    Bob      bob@mail.com      C       16
3    Hannah   hannah@mail.com   A       17

Conclusion

Although we still need to know the fundamental structure of the input XML document we wish to parse, this is a more efficient implementation of the XML parsing function.

Related Article - Pandas DataFrame