How to Convert XML to CSV Using Python
-
Convert
XML
toCSV
Using Pythonxml
Module -
Convert
XML
toCSV
Using Pythonxmltodict
Library -
Convert XML to CSV Using Python
lxml
andCSV
Libraries
XML
or Extensible Markup Language
is a markup language that stores data in a document in both human-readable and machine-readable format. It has a file extension, .xml
.
CSV
or Comma Separated Values
stores data in a document separated by a comma. It has a file extension, .csv
. In this article, we will learn how to convert XML
data to CSV
data using Python.
Convert XML
to CSV
Using Python xml
Module
To convert XML
to CSV
, we can use the in-built xml
module in Python.
This module has yet another module, ElementTree
, that we can use to represent an XML
document as a tree. It has a method parse()
that accepts an XML
document as a parameter.
It parses the whole XML
document and stores it in the form of a tree. The tree has a getroot()
method that returns the root element of the tree.
Using a for
loop, we can iterate over the tree and access the data using the tag names (name
, rollnumber
, and age
).
To understand the conversion, we need some sample XML
data. You can use the following XML
data and store it in a file named input.xml
.
The following code snippet will use this data, and you should too so that we are on the same page.
<students>
<student>
<name>Rick Grimes</name>
<rollnumber>1</rollnumber>
<age>15</age>
</student>
<student>
<name>Lori Grimes</name>
<rollnumber>2</rollnumber>
<age>16</age>
</student>
<student>
<name>Carl Grimes</name>
<rollnumber>3</rollnumber>
<age>14</age>
</student>
<student>
<name>Judith Grimes</name>
<rollnumber>4</rollnumber>
<age>13</age>
</student>
</students>
Below is the Python code that performs the conversion from XML to CSV:
import xml.etree.ElementTree as ET
import pandas as pd
# Parse the XML file
tree = ET.parse("students.xml")
root = tree.getroot()
# Extracting the data and headers dynamically
data = []
headers = []
for student in root:
student_data = []
for detail in student:
if detail.tag not in headers:
headers.append(detail.tag)
student_data.append(detail.text)
data.append(student_data)
# Creating a DataFrame
df = pd.DataFrame(data, columns=headers)
# Writing the DataFrame to a CSV file
df.to_csv("students.csv", index=False)
CSV Content:
name,rollnumber,age
Rick Grimes,1,15
Lori Grimes,2,16
Carl Grimes,3,14
Judith Grimes,4,13
Detailed Explanation
1. Importing Libraries
import xml.etree.ElementTree as ET
import pandas as pd
We begin by importing the ElementTree library, which is included in Python’s standard library and is used for parsing and creating XML data. We also import the Pandas library, a powerful tool for data manipulation and analysis.
2. Parsing the XML File
tree = ET.parse("students.xml")
root = tree.getroot()
The ET.parse()
function is used to parse the XML file. It reads the file, parses the XML data, and returns an ElementTree object. We then obtain the root element of the XML document using the getroot()
method.
3. Extracting Data and Headers Dynamically
data = []
headers = []
for student in root:
student_data = []
for detail in student:
if not headers:
headers.append(detail.tag)
student_data.append(detail.text)
data.append(student_data)
We initialize two empty lists, data
and headers
. We then iterate over each student element in the root of the XML document. For each student, we initialize an empty list named student_data
to hold the data for that particular student.
We then iterate over each child element of the student element, extracting the tag names to form the headers and the text content to form the data. We check if the headers
list is empty; if it is, we append the tag names. This ensures that we only populate the headers list once. The text content of each child element is appended to the student_data
list, which is then appended to the data
list to form a nested list of all students’ data.
4. Creating a DataFrame
df = pd.DataFrame(data, columns=headers)
We use Pandas to convert the extracted data into a DataFrame. The pd.DataFrame()
function is called with the data
list, and the columns
parameter is set to the headers
list. This organizes the data into a tabular format, making it easy to manipulate and analyze.
5. Writing the DataFrame to a CSV File
df.to_csv("students.csv", index=False)
Finally, we use the to_csv()
method of the DataFrame object to write the data to a CSV file named ‘students.csv’. The index=False
parameter ensures that the index is not written into the CSV file.
Convert XML
to CSV
Using Python xmltodict
Library
xmltodict
is a Python library that provides a convenient way to work with XML data in a more Pythonic and dictionary-like manner. It allows you to parse XML documents and convert them into easily navigable Python dictionaries. This makes it simpler to extract and manipulate data from XML files without the need for complex parsing or traversal code.
With xmltodict
, you can convert XML data into a nested dictionary structure, where XML elements become dictionary keys, and their corresponding values are either nested dictionaries or lists, depending on the XML structure. This library streamlines the process of extracting specific data from XML files and simplifies working with XML-based APIs.
Before we proceed, ensure that you have both xmltodict
and pandas
installed. If not, you can install them via pip:
pip install xmltodict
pip install pandas
We will use the same XML file as above and will parse this XML data, convert it to a dictionary using xmltodict
, transform it into a pandas DataFrame, and finally write it to a CSV file.
Python Code
Here is the Python code that performs the conversion:
import xmltodict
import pandas as pd
import json
# Load and parse the XML file
with open("students.xml", "r") as file:
xml_content = file.read()
# Convert XML to dictionary
data_dict = xmltodict.parse(xml_content)
# Convert dictionary to JSON
json_data = json.dumps(data_dict["students"]["student"])
# Convert JSON to DataFrame
df = pd.read_json(json_data)
# Write DataFrame to CSV
df.to_csv("students.csv", index=False)
Explanation
-
Loading and Parsing the XML File:
with open("students.xml", "r") as file: xml_content = file.read()
- Here, we open the ‘students.xml’ file in read mode (‘r’) and read its content into the
xml_content
variable. This XML file likely contains information about students.
-
Converting XML to Dictionary:
data_dict = xmltodict.parse(xml_content)
- The
xmltodict.parse()
function is used to parse the XML content stored inxml_content
and convert it into a Python dictionary. This dictionary represents the structured data from the XML file.
-
Converting Dictionary to JSON:
json_data = json.dumps(data_dict["students"]["student"])
- We extract the ‘student’ data from the parsed dictionary using
data_dict['students']['student']
, which represents a list of student records. - Then, we use
json.dumps()
to convert this extracted data into a JSON formatted string, stored in thejson_data
variable.
-
Converting JSON to DataFrame:
df = pd.read_json(json_data)
- With the JSON-formatted data in
json_data
, we utilizepd.read_json()
to convert it into a pandas DataFrame. This DataFrame will now hold the student data in a tabular format, making it easier to work with.
-
Writing DataFrame to CSV:
df.to_csv("students.csv", index=False)
- Finally, we take advantage of the DataFrame’s
to_csv()
method to export the student data into a CSV file named ‘students.csv’. Theindex=False
parameter ensures that the DataFrame’s index is not included in the CSV file.
Output
The resulting CSV file, students.csv
, will contain the following data:
name,rollnumber,age
Rick Grimes,1,15
Lori Grimes,2,16
Carl Grimes,3,14
Judith Grimes,4,13
Convert XML to CSV Using Python lxml
and CSV
Libraries
Next, we will delve into the process of converting an XML file to a CSV file using the lxml
and csv
libraries in Python.
The lxml
library is a Python library that provides a way to work with XML and HTML documents. It is a high-performance, production-quality library that is easy to use and implements the ElementTree API, a Pythonic binding for the C libraries libxml2 and libxslt. It is feature-rich and includes support for XPath, XSLT, Relax NG, and more.
Ensure that you have the lxml
library installed. If not, you can install it via pip:
pip install lxml
We will parse the same XML data using lxml
, extract the required data, and then use the csv
library to write this data into a CSV file.
Python Code
Here is the Python code that performs the conversion:
from lxml import etree
import csv
# Load and parse the XML file
tree = etree.parse("students.xml")
root = tree.getroot()
# Open a CSV file in write mode
with open("students.csv", mode="w", newline="") as file:
writer = csv.writer(file)
# Write the header dynamically
headers = [element.tag for element in root.find(".//student")]
writer.writerow(headers)
# Extract data from XML and write to CSV dynamically
for student in root.findall("student"):
row_data = [element.text for element in student]
writer.writerow(row_data)
Explanation
1. Loading and Parsing the XML File
tree = etree.parse("students.xml")
root = tree.getroot()
In these lines, the etree.parse()
function from the lxml
library is used to parse the XML file named ‘students.xml’. The parse()
function reads the file, parses the XML data, and returns an ElementTree object. The getroot()
method is then called on this object to get the root element of the XML document. The root element is the parent element that holds all other elements in the XML file.
2. Opening the CSV File in Write Mode
with open("students.csv", mode="w", newline="") as file:
writer = csv.writer(file)
Here, a CSV file named ‘students.csv’ is opened in write mode. The with
statement ensures that the file is properly closed after its suite finishes. The csv.writer()
function is used to create a writer object for writing into the CSV file.
3. Dynamically Writing the Headers
headers = [element.tag for element in root.find(".//student/*")]
writer.writerow(headers)
This part is crucial as it ensures the code’s flexibility to handle any XML structure. The headers for the CSV file are dynamically extracted from the XML file. The find()
method, combined with an XPath expression, is used to locate the first ‘student’ element in the XML file, and a list comprehension retrieves the tag names of all its child elements. These tag names become the headers of the CSV file, ensuring that no hardcoding of header names is necessary.
4. Extracting Data and Writing Rows Dynamically
for student in root.findall("student"):
row_data = [element.text for element in student]
writer.writerow(row_data)
In this segment, a for
loop iterates over all ‘student’ elements found in the root of the XML document. For each ‘student’ element, another list comprehension is used to extract the text content of all its child elements. This data is stored in the row_data
list, which is then written into the CSV file as a new row using the writerow()
method of the writer object.
Output
The resulting CSV file, students.csv
, will contain the following data:
Name,Roll Number,Age
Rick Grimes,1,15
Lori Grimes,2,16
Carl Grimes,3,14
Judith Grimes,4,13
Related Article - Python XML
- How to Pretty Print XML Output Pretty in Python
- How to Create an XML Parser in Python
- How to Convert XML to Dictionary in Python
- How to Convert XML to JSON in Python
Related Article - Python CSV
- How to Import Multiple CSV Files Into Pandas and Concatenate Into One DataFrame
- How to Split CSV Into Multiple Files in Python
- How to Compare Two CSV Files and Print Differences Using Python
- How to Convert XLSX to CSV File in Python
- How to Write List to CSV Columns in Python
- How to Write to CSV Line by Line in Python