How to Fix Memory Error in Pandas
This tutorial explores the concept of memory error in Pandas.
What is the Memory Error in Pandas
While working with Pandas, an analyst might encounter multiple errors that the code interpreter throws. These errors are widely ranged and would help us better investigate the issue.
In this tutorial, we aim to better understand the memory error thrown by Pandas, the reason behind it throwing that error and the potential ways by which this error can be resolved.
Firstly, let us understand what this error means. A memory error means that there is not enough memory on the server or the database you’re trying to access to complete the operation or task you wish to perform.
This error is generally associated with files and CSV data that have the order of hundreds of gigabytes. It is important to understand what causes this error and avoid such an error to have more data storage.
Solving this error can also help develop an efficient and thorough database with proper rule management.
Assuming we’re trying to fetch data from a CSV file with more than 1000 gigabytes of data, we will naturally face the memory error discussed above. This error can be illustrated below.
MemoryError
Press any key to continue . . .
There is a method by which we can avoid this memory error potentially. However, before we do that, let us create a dummy data frame to work with.
We will call this data frame dat1
. Let us create this data frame using the following code.
import pandas as pd
dat1 = pd.DataFrame(pd.np.random.choice(["1.0", "0.6666667", "150000.1"], (100000, 10)))
The query creates 10 columns indexed from 0 to 9 and 100000 values. To view the entries in the data, we use the following code.
print(dat1)
The above code gives the following output.
0 1 2 ... 7 8 9
0 1.0 1.0 1.0 ... 150000.1 0.6666667 0.6666667
1 0.6666667 0.6666667 1.0 ... 0.6666667 150000.1 0.6666667
2 1.0 1.0 150000.1 ... 150000.1 1.0 150000.1
3 150000.1 0.6666667 0.6666667 ... 1.0 150000.1 1.0
4 150000.1 0.6666667 150000.1 ... 150000.1 0.6666667 0.6666667
... ... ... ... ... ... ... ...
99995 150000.1 150000.1 1.0 ... 150000.1 1.0 0.6666667
99996 1.0 1.0 150000.1 ... 0.6666667 0.6666667 150000.1
99997 150000.1 150000.1 1.0 ... 0.6666667 150000.1 0.6666667
99998 1.0 0.6666667 0.6666667 ... 0.6666667 1.0 150000.1
99999 1.0 0.6666667 150000.1 ... 1.0 150000.1 1.0
[100000 rows x 10 columns]
How to Avoid the Memory Error in Pandas
Now let us see the total space that this data frame occupies using the following code.
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
The code gives the following output.
# 224544 (~224 MB)
To avoid spending so much space on just a single data frame, let us do this by specifying exactly the data type we’re dealing with.
This helps us reduce the total memory required as lesser space is required to understand the type of data, and more space can be allotted to the actual data under consideration.
We can do this using the following query.
df = pd.DataFrame(pd.np.random.choice([1.0, 0.6666667, 150000.1], (100000, 10)))
resource.getrusage(resource.RUSAGE_SELF).ru_maxrss
The output of the code is below.
# 79560 (~79 MB)
Since we have specified the data type as int
here by not assigning strings, we have successfully reduced the memory space required for our data.
Thus, we have learned the meaning, cause, and potential solution with this tutorial regarding the memory error thrown in Pandas.