pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

Shirui816 · 2018-06-18T05:49:46Z

I am using anaconda and my pandas version is 0.23.1. When dealing with single large file, setting chunksize or iterator=True works fine and memory usage is low. The problem raises when I am trying to dealing with 5000+ files (file names are in filelist):

trajectory = [pd.read_csv(f, delim_whitespace=True, header=None, chunksize=10000) for f in filelist]

The memory usage raises very soon and exceeds 20GB+ quickly. However, trajectory = [open(f, 'r')....] and reading 10000 lines from each file works fine.

I also tried low_memory=True option but it's not working. Both engine='python' and memory_map=<some file> options solve the memory problem but when I use the datas with

X = np.asarray([f.get_chunk().values for f in trajectory])
FX = np.fft.fft(X, axis=0)

The multi-threading of MKL-FFT does not work anymore.

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-06-18T23:45:58Z

This might be related to Memory leak in pd.read_csv or DataFrame #21353
When you say you tried low_memory=True, and it's not working, what do you mean?
You might need to check your concatenation when using engine='python' and memory_map=...

Shirui816 · 2018-06-19T07:46:35Z

Thanks for replying :) @gfyoung

I mean that adding low_memory=True option to

trajectory = [pd.read_csv(f, delim_whitespace=True, header=None, chunksize=10000, low_memory=True) for f in filelist]

the memory usage is not change in contrast to the case without this option.

Shirui816 · 2018-06-19T08:10:42Z

The environment is:
CentOS Linux release 7.4.1708 (Core)
Python 3.6.5 :: Anaconda custom (64-bit)
with pandas version 0.23.1

From #21353 , I tracked the memory usage:

import psutil
import pandas as pd

traj = []
i = 0
for f in argv[1:]:
    a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
    traj.append(a)
    if not i % 100:
        print('%s th file, memory: ' % (i),psutil.Process().memory_info().rss / 1024**2)
    i += 1

and the output:

0 th file, memory:  61.96484375
100 th file, memory:  214.66015625
200 th file, memory:  367.32421875
300 th file, memory:  520.046875
400 th file, memory:  674.76953125
500 th file, memory:  829.5
600 th file, memory:  982.22265625
700 th file, memory:  1134.9453125
800 th file, memory:  1287.66796875
900 th file, memory:  1442.3828125
1000 th file, memory:  1597.109375
1100 th file, memory:  1749.84765625
1200 th file, memory:  1932.57421875
1300 th file, memory:  2122.796875
1400 th file, memory:  2313.01953125
1500 th file, memory:  2503.2421875
...
4600 th file, memory:  8414.0234375
4700 th file, memory:  8604.24609375
4800 th file, memory:  8794.4765625
4900 th file, memory:  8984.6953125
5000 th file, memory:  9174.921875
5100 th file, memory:  9367.14453125
5200 th file, memory:  9557.37109375
5300 th file, memory:  9747.59375
5400 th file, memory:  9937.81640625
5500 th file, memory:  10128.04296875
5600 th file, memory:  10320.26953125

It turns out that the memory increases ~1.9 mB per file. The files using in this test is about 800 kB for each.

Also tried malloc_trim(0) from #2659 :

import psutil
import pandas as pd
from ctypes import cdll, CDLL
cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")


traj = []
i = 0
for f in argv[1:]:
    libc.malloc_trim(0)
    a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
    traj.append(a)
    if not i % 100:
        print('%s th file, memory: ' % (i),psutil.Process().memory_info().rss / 1024**2)
    i += 1

The results are same with above, the memory usage still increases quickly.

gfyoung · 2018-06-19T08:20:40Z

Hmm...admittedly, this is the first time I've been seeing so many of these issues regarding memory leakage in read_csv, and I'm still uncertain as to whether it has to deal with DataFrame or read_csv.

cc @jreback

Liam3851 · 2018-07-03T21:49:33Z

@Shirui816 You're appending the result of pd.read_csv to a list:

traj = []
for f in argv[1:]:
    a = pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#')
    traj.append(a)

Adding objects to a list means they can't be garbage collected. Thus you're keeping keeping thousands of file handles and the related iterator objects open-- so we would expect memory use to grow. I've confirmed that that memory does not grow if you remove the traj.append call.

If the issue is that the memory use is growing faster than you expect based on the filesizes (based on your comment ("It turns out that the memory increases ~1.9 mB per file. The files using in this test is about 800 kB for each."), note that you're not actually reading the file in all the way in the above call, you're creating a persistent iterator and file handle on the file, because you're using the chunksize parameter. If you only want the first 10000 lines of the file, use

a = next(pd.read_csv(f, squeeze=0, header=None, delim_whitespace=1, chunksize=10000, comment='#'))

This will throw away the handle and the rest of the iterator object and contain just your data.

Shirui816 · 2018-07-04T02:57:14Z

@Liam3851 Thank you very much for the explanation. I increased file size and re-ran the test, the memory gain per file was still about 1.9mB. This handler is much larger than the open function....emmmm.... Is option engine='python' means the iterator and file handler thing held by python like open function? I am wondering why after adding this option (or/and adding memory_map=... option) the parallel acceleration of MKL doesnt work any more. I totally have no clue about this problem. Are there any suggested tests to find the reason? The codes are in my first post, after creating a list of iterators in trajectory, take a chunk from each handler then perform an FFT.

The environment is:
CentOS Linux release 7.4.1708 (Core)
Python 3.6.5 :: Anaconda custom (64-bit)
with pandas version 0.23.1

gfyoung added IO CSV read_csv, to_csv Low-Memory labels Jun 18, 2018

Shirui816 closed this as completed Jul 4, 2018

PMeira mentioned this issue Jan 16, 2019

read_csv using C engine and chunksize can grow memory usage exponentially in 0.24.0rc1 #24805

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

Shirui816 commented Jun 18, 2018 •

edited

gfyoung commented Jun 18, 2018

Shirui816 commented Jun 19, 2018 •

edited

Shirui816 commented Jun 19, 2018 •

edited

gfyoung commented Jun 19, 2018

Liam3851 commented Jul 3, 2018

Shirui816 commented Jul 4, 2018 •

edited

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

Comments

Shirui816 commented Jun 18, 2018 • edited

gfyoung commented Jun 18, 2018

Shirui816 commented Jun 19, 2018 • edited

Shirui816 commented Jun 19, 2018 • edited

gfyoung commented Jun 19, 2018

Liam3851 commented Jul 3, 2018

Shirui816 commented Jul 4, 2018 • edited

Shirui816 commented Jun 18, 2018 •

edited

Shirui816 commented Jun 19, 2018 •

edited

Shirui816 commented Jun 19, 2018 •

edited

Shirui816 commented Jul 4, 2018 •

edited