Chunksize in Pandas
The pandas
library in Python allows us to work with DataFrames. Data is organized into rows and columns in a DataFrame.
We can read data from multiple sources into a DataFrame.
In real-life situations, we can deal with datasets that contain thousands of rows and columns. This dataset can be read into a DataFrame depending on the source.
Chunksize in Pandas
Sometimes, we use the chunksize
parameter while reading large datasets to divide the dataset into chunks of data. We specify the size of these chunks with the chunksize
parameter.
This saves computational memory and improves the efficiency of the code.
First let us read a CSV file without using the chunksize
parameter in the read_csv()
function. In our example, we will read a sample dataset containing movie reviews.
import pandas as pd
df = pd.read_csv("ratings.csv")
print(df.shape)
print(df.info)
Output:
(25000095, 4)
<bound method DataFrame.info of userId movieId rating timestamp
0 1 296 5.0 1147880044
1 1 306 3.5 1147868817
2 1 307 5.0 1147868828
3 1 665 5.0 1147878820
4 1 899 3.5 1147868510
... ... ... ... ...
25000090 162541 50872 4.5 1240953372
25000091 162541 55768 2.5 1240951998
25000092 162541 56176 2.0 1240950697
25000093 162541 58559 4.0 1240953434
25000094 162541 63876 5.0 1240952515
[25000095 rows x 4 columns]>
In the above example, we read the given dataset and display its details. The shape
attribute returns the rows and columns, 25000095 and 4, respectively.
We also display some information about the rows and columns of the dataset using the info
attribute.
We can see that this dataset contains 2500005 rows, and it takes a lot of the computer’s memory to process such large datasets. In such cases, we can use the chunksize
parameter.
For this, let us first understand what iterators are in Python.
An iterable sequence can be looped over using a for
loop. The for
loop applies the iter()
method to such objects internally to create iterators.
We can access the elements in the sequence with the next()
function.
When we use the chunksize
parameter, we get an iterator. We can iterate through this object to get the values.
import pandas as pd
df = pd.read_csv("ratings.csv", chunksize=10000000)
for i in df:
print(i.shape)
Output:
(10000000, 4)
(10000000, 4)
(5000095, 4)
In the above example, we specify the chunksize
parameter with some value, and it reads the dataset into chunks of data with the given rows. For our dataset, we had three iterators when we specified the chunksize
operator as 10000000.
The returned object is not a DataFrame but rather a pandas.io.parsers.TextFileReader
object.
We can iterate through the object and access the values. Note that the number of columns is the same for each iterator which means that the chunksize
parameter only considers the rows while creating the iterators.
This parameter is available with other functions that can read data from other sources like pandas.read_json
, pandas.read_stata
, pandas.read_sql_table
, pandas.read_sas
, and more. It is recommended to check the official documentation before using this parameter to see its availability.
Manav is a IT Professional who has a lot of experience as a core developer in many live projects. He is an avid learner who enjoys learning new things and sharing his findings whenever possible.
LinkedIn