How to Read CSV Data From String in Pandas
The main aim of this article is to demonstrate how to read CSV text data from a string or package data using Pandas in Python.
Python Pandas read_csv
From String
the Problem
Data can be stored in multiple forms; two of the main forms are defined as structured and unstructured. These main forms can then be further divided into many forms, CSV (comma-separated values) being one of them.
While reading data from a CSV file, be it for some complex queries or processing, depending on the situation, we may need to read data from a string or a certain package.
Consider the following code:
import pandas as pd
import pkgutil
from io import StringIO
def get_data_file(pkg, path):
f = StringIO()
contents = unicode(pkgutil.get_data("pymc.examples", "data/wells.dat"))
f.write(contents)
return f
wells = get_data_file("pymc.examples", "data/wells.dat")
data = pd.read_csv(wells, delimiter=" ", index_col="id", dtype={"switch": np.int8})
Output:
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 401, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 209, in _read
parser = TextFileReader(filepath_or_buffer, **kwds)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 509, in __init__
self._make_engine(self.engine)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 611, in _make_engine
self._engine = CParserWrapper(self.f, **self.options)
File "/usr/local/lib/python2.7/dist-packages/pandas/io/parsers.py", line 893, in __init__
self._reader = _parser.TextReader(src, **kwds)
File "parser.pyx", line 441, in pandas._parser.TextReader.__cinit__ (pandas/src/parser.c:3940)
File "parser.pyx", line 551, in pandas._parser.TextReader._get_header (pandas/src/parser.c:5096)
pandas._parser.CParserError: Passed header=0 but only 0 lines in file
In the following code, a function named get_data_file
is created to obtain data from the package. To perform this operation, it takes two methods named pkg
and path
, which store the package’s name and the file’s path, respectively.
The returned value of get_data_file
is then passed on to the read_csv
method to read the data, but as seen in the output, an error is returned of type CParserError
with the description Passed header=0 but only 0 lines in file
.
the Solution
The solution to this problem can be approached in two different ways based on the requirement.
Approach 1 - For String With io.StringIO
Consider the following code:
import pandas as pd
from io import StringIO
df = pd.read_csv(
StringIO(
"id,switch, arsenic, dist, assoc, educ\n"
"1, 2.36, 16.826000, 0, 0, 0\n"
"2, 1, 0.71, 47.321999, 0, 0\n"
"3, 0, 2.07, 20.966999, 0, 10\n"
"4, 1, 1.15, 21.486000, 0, 12\n"
"5, 1, 1.10, 40.874001, 1, 14\n"
"6, 2, 1.22, 23.123131, 1, 23\n"
"7, 5, 5.33, 29.232322, 0, 38\n"
"8, 1, 2.38, 90.222221, 1, 10\n"
"9, 9, 2.01, 10.222334, 0, 0\n"
"10, 0, 9.12, 20.324252, 0, 10\n"
),
index_col=0,
)
print(df)
Output:
switch arsenic dist assoc educ
id
1 2.36 16.826 0.000000 0 0
2 1.00 0.710 47.321999 0 0
3 0.00 2.070 20.966999 0 10
4 1.00 1.150 21.486000 0 12
5 1.00 1.100 40.874001 1 14
6 2.00 1.220 23.123131 1 23
7 5.00 5.330 29.232322 0 38
8 1.00 2.380 90.222221 1 10
9 9.00 2.010 10.222334 0 0
10 0.00 9.120 20.324252 0 10
Using the StringIO
module, it is possible to pass a string to the read_csv
method to read CSV text data from a string.
The StringIO
module helps in executing methods that normally require “file-like” objects; by passing a string to its constructor, it creates a “file-like” object which can then be passed to methods requiring a “file-like” object.
It is important to note that in Python 2.7, the StringIO
module was defined as StringIO
, while in Python 3+, it has shifted to io.StringIO
.
Proper implementation must be implemented to handle the cases where the code is supposed to be flexible and is targeted to run on both Python2 and Python3.
Approach 2 - For Package With io.BytesIO
Consider the following code:
import numpy as np
import pandas as pd
import io
import pkgutil
wells = pkgutil.get_data("pymc.examples", "data/wells.dat")
print("Data Type: ", str(type(wells)))
df = pd.read_csv(
io.BytesIO(wells),
encoding="utf8",
sep=" ",
index_col="id",
dtype={"switch": np.int8},
)
print(df.head())
Output:
Data Type: <class 'bytes'>
switch arsenic dist assoc educ
id
1 1 2.36 16.826000 0 0
2 1 0.71 47.321999 0 0
3 0 2.07 20.966999 0 10
4 1 1.15 21.486000 0 12
5 1 1.10 40.874001 1 14
In the case of reading CSV data from a package, we can first use the pkgutil.get_data
method to fetch the data, then to read the CSV data from the fetched data, we can use the read_csv
method.
It is important to note that we must use the io.BytesIO
method to pass the data fetched from the pkgutil.get_data
method. And to mention the encoding of data, which is utf8
in our case.
Separators are defined using the sep
argument, and the index column should be denoted using the index_col
argument.
All in all, the read_csv
method takes the following arguments in the code mentioned above.
io.BytesIO(wells)
corresponds to the data fetched from the desired package and must be passed in bytes form using theio.BytesIO
method to convert it into bytes and pass it as a “file-like” object due to the “file-like” object requirement ofread_csv
.encoding='utf8'
refers to the text encoding of the input, which can be different values based on the type of data passed.sep=' '
denotes what type of separators the data consists of. It can be different based on how the supposed CSV data was created in the first place.index_col="id"
refers to the name of the index column of data; again, it depends on the data and can vary from file to file.
Hello! I am Salman Bin Mehmood(Baum), a software developer and I help organizations, address complex problems. My expertise lies within back-end, data science and machine learning. I am a lifelong learner, currently working on metaverse, and enrolled in a course building an AI application with python. I love solving problems and developing bug-free software for people. I write content related to python and hot Technologies.
LinkedIn