Memory leak in pd.read_csv or DataFrame #21353

kuraga · 2018-06-07T13:11:43Z

Code Sample, a copy-pastable example if possible

import sys

m = int(sys.argv[1])
n = int(sys.argv[2])

with open('df.csv', 'wt') as f:
    for i in range(n-1):
        f.write('c' + str(i) + ',')
    f.write('c' + str(n-1) + '\n')
    for j in range(m):
        for i in range(n-1):
            f.write('1,')
        f.write('1\n')


import psutil

print(psutil.Process().memory_info().rss / 1024**2)

import pandas as pd
df = pd.read_csv('df.csv')

print(df.shape)
print(psutil.Process().memory_info().rss / 1024**2)

import gc
del df
gc.collect()

print(psutil.Process().memory_info().rss / 1024**2)

Problem description

$ ~/miniconda3/bin/python3 g.py 1 1
11.60546875
(1, 1)
64.02734375
64.02734375

$ ~/miniconda3/bin/python3 g.py 5000000 15
11.58203125
(5000000, 15)
640.45703125
68.25

$ ~/miniconda3/bin/python3 g.py 5000000 20
11.84375
(5000000, 20)
1586.65625
823.71875 - !!!

$ ~/miniconda3/bin/python3 g.py 10000000 10
11.83984375
(10000000, 10)
830.92578125
67.984375

$ ~/miniconda3/bin/python3 g.py 10000000 15
11.89453125
(10000000, 15)
2344.3046875
1199.89453125 - !!!

Two issues:

There is a "standard" leak after reading any CSV OR just creating by pd.DataFrame() - ~53Mb.
We see a large leak in some other cases.

cc @gfyoung

Output of `pd.show_versions()`

(same for 0.21, 0.22, 0.23)

pandas: 0.23.0 pytest: None pip: 9.0.3 setuptools: 39.0.1 Cython: None numpy: 1.14.3 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.4.0 sphinx: None patsy: 0.5.0 dateutil: 2.7.3 pytz: 2018.4 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.2.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None

The text was updated successfully, but these errors were encountered:

gfyoung · 2018-06-07T16:17:30Z

@kuraga : Thanks for the updated issue!

cc @jreback @jorisvandenbossche

kuraga · 2018-06-13T12:10:43Z

Seems like it's not pd.read_csv issue only...

nynorbert · 2018-06-13T19:08:08Z

I have a similiar issue. I have tried to debug it with memory_profiler but I don't see the source of the leak.
The output of the profiler:

 Line #    Mem usage    Increment   Line Contents
 ================================================
    187    261.3 MiB      0.0 MiB        if "history" in self.watch_list:
    188    491.9 MiB    230.6 MiB            self.history = pd.read_csv(self.path + '/' + self.history_files[self.current][1], delimiter=';', header=None)
    189    491.9 MiB      0.0 MiB            self.history_group = self.history.groupby([0])

This snippet of the code is inside a loop and every time it increments the memory usage. I also tried to delete the history and history_group object and calling gc.collect() manually, but nothing seems to work.
Is it possible that this is some cyclic dependency between history and history_group? And if it is then why deleting both history_group and history was not solving the problem?

p.s: My pandas version is 0.23.1

nynorbert · 2018-06-13T20:09:44Z

Sorry, I was wrong. Not the read_csv which consumes the memory rather than a drop:

Line #    Mem usage    Increment   Line Contents
 ================================================
   265   1425.1 MiB      9.6 MiB                        self.history.drop(self.history_group.get_group(self.current_timestamp).index)

And I think I found out that malloc_trim solves the problem, similar to this: #2659

@kuraga Maybe you should try it.

zhezherun · 2018-10-08T15:06:56Z

I also noticed a memory leak in read_csv and ran it through valgrind, which said that the result of the kset_from_list function was never freed. I was able to fix this leak locally by patching parsers.pyx and rebuilding pandas.

@gfyoung, could you please review the patch below? It might also help with the leak discussed here, although I am not sure if it is the same leak or not. The patch

Moves the allocation of na_hashset further down, closer to where it is used. Otherwise it will not be freed if continue is executed,
Makes sure that na_hashset is deleted if there is an exception,
Also cleans up the allocation inside kset_from_list before raising an exception.

--- parsers.pyx	2018-08-01 19:57:16.000000000 +0100
+++ parsers.pyx	2018-10-08 15:25:32.124526087 +0100
@@ -1054,18 +1054,6 @@
 
             conv = self._get_converter(i, name)
 
-            # XXX
-            na_flist = set()
-            if self.na_filter:
-                na_list, na_flist = self._get_na_list(i, name)
-                if na_list is None:
-                    na_filter = 0
-                else:
-                    na_filter = 1
-                    na_hashset = kset_from_list(na_list)
-            else:
-                na_filter = 0
-
             col_dtype = None
             if self.dtype is not None:
                 if isinstance(self.dtype, dict):
@@ -1090,13 +1078,26 @@
                                               self.c_encoding)
                 continue
 
-            # Should return as the desired dtype (inferred or specified)
-            col_res, na_count = self._convert_tokens(
-                i, start, end, name, na_filter, na_hashset,
-                na_flist, col_dtype)
+            # XXX
+            na_flist = set()
+            if self.na_filter:
+                na_list, na_flist = self._get_na_list(i, name)
+                if na_list is None:
+                    na_filter = 0
+                else:
+                    na_filter = 1
+                    na_hashset = kset_from_list(na_list)
+            else:
+                na_filter = 0
 
-            if na_filter:
-                self._free_na_set(na_hashset)
+            try:
+                # Should return as the desired dtype (inferred or specified)
+                col_res, na_count = self._convert_tokens(
+                    i, start, end, name, na_filter, na_hashset,
+                    na_flist, col_dtype)
+            finally:
+                if na_filter:
+                    self._free_na_set(na_hashset)
 
             if upcast_na and na_count > 0:
                 col_res = _maybe_upcast(col_res)
@@ -2043,6 +2044,7 @@
 
         # None creeps in sometimes, which isn't possible here
         if not PyBytes_Check(val):
+            kh_destroy_str(table)
             raise ValueError('Must be all encoded bytes')
 
         k = kh_put_str(table, PyBytes_AsString(val), &ret)

gfyoung · 2018-10-08T15:50:44Z

@zhezherun : That's a good catch! Create a PR, and we can review.

kuraga · 2018-10-23T09:50:13Z

Trying to patch is cool but fear that #2659 (comment)...

* Move allocation of na_hashset down to avoid a leak on continue * Delete na_hashset if there is an exception * Clean up table before raising an exception Closes pandas-devgh-21353.

* Move allocation of na_hashset down to avoid a leak on continue * Delete na_hashset if there is an exception * Clean up table before raising an exception Closes gh-21353.

kuraga · 2018-11-19T12:31:54Z

@zhezherun , @TomAugspurger , thanks very much!

But could you, please, describe the connection with @nynorbert 's observation:

And I think I found out that malloc_trim solves the problem, similar to this: #2659

So, we had memory leak in Pandas in addition to glibc's feature to not trim after free?

Thanks.

TomAugspurger · 2018-11-19T12:39:40Z

I don't know C, so no. Perhaps @nynorbert can clarify.

* Move allocation of na_hashset down to avoid a leak on continue * Delete na_hashset if there is an exception * Clean up table before raising an exception Closes pandas-devgh-21353.

kuraga · 2020-06-06T11:28:14Z

glibc.malloc.mxfast tunable has been introduced in Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html).

wasonkartik · 2020-07-23T13:39:41Z

Hi, I am facing this issue on google compute engine (Windows Server 2012 R2 Datacenter, 64 bit). How do I fix it? I have installed the latest version of Pandas.

gberth · 2020-08-20T06:25:41Z

Theory: When reading large files with Python, pd.read_csv, csv.reader, plain python io, or with mmap it seems that the thread reading will hold memory. If the same thread does a new read, the already allocated memory will be used, if a new thread reads, it will aquire additional memory. With panda on google the reading of 3 files of app. 100 MB has required app 3GB that is not released. With csv.reader app 300MB, and plain read and mmap app 200MB. So multithreading read of the 3 files can result in extensive storage use (25GB+). This is not my home field, but it has been a frustrating week looking for leaks. If I'm wrong, sorry for the disturbance. (Python 3.7 and 3.8)

bashtage · 2020-08-20T06:34:43Z

@gberth If you use engine="python" do you see that same pattern?

gberth · 2020-08-25T07:01:56Z

Sorry, no difference. If I ensure reading files twice in the same thread, it does not consume or hold more memory. Read in two different threads, and both holds 2GB+ as long as the threads live (at least looks like that to me)

gfyoung added IO CSV read_csv, to_csv Low-Memory labels Jun 7, 2018

gfyoung mentioned this issue Jun 18, 2018

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

Closed

zhezherun mentioned this issue Oct 10, 2018

Fixing memory leaks in read_csv #23072

Merged

jreback added this to the 0.24.0 milestone Oct 10, 2018

TomAugspurger closed this as completed in #23072 Nov 19, 2018

heetbeet mentioned this issue Aug 19, 2020

Cyclic GC issues #2659

Closed

viper7882 mentioned this issue Feb 27, 2023

BUG: Memory leak after pd.read_csv() with default parameters #51667

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in pd.read_csv or DataFrame #21353

Memory leak in pd.read_csv or DataFrame #21353

kuraga commented Jun 7, 2018 •

edited

gfyoung commented Jun 7, 2018 •

edited

kuraga commented Jun 13, 2018 •

edited

nynorbert commented Jun 13, 2018 •

edited

nynorbert commented Jun 13, 2018

zhezherun commented Oct 8, 2018

gfyoung commented Oct 8, 2018

kuraga commented Oct 23, 2018

kuraga commented Nov 19, 2018

TomAugspurger commented Nov 19, 2018

kuraga commented Jun 6, 2020

wasonkartik commented Jul 23, 2020

gberth commented Aug 20, 2020

bashtage commented Aug 20, 2020

gberth commented Aug 25, 2020

Memory leak in pd.read_csv or DataFrame #21353

Memory leak in pd.read_csv or DataFrame #21353

Comments

kuraga commented Jun 7, 2018 • edited

Code Sample, a copy-pastable example if possible

Problem description

Output of pd.show_versions()

gfyoung commented Jun 7, 2018 • edited

kuraga commented Jun 13, 2018 • edited

nynorbert commented Jun 13, 2018 • edited

nynorbert commented Jun 13, 2018

zhezherun commented Oct 8, 2018

gfyoung commented Oct 8, 2018

kuraga commented Oct 23, 2018

kuraga commented Nov 19, 2018

TomAugspurger commented Nov 19, 2018

kuraga commented Jun 6, 2020

wasonkartik commented Jul 23, 2020

gberth commented Aug 20, 2020

bashtage commented Aug 20, 2020

gberth commented Aug 25, 2020

kuraga commented Jun 7, 2018 •

edited

Output of `pd.show_versions()`

gfyoung commented Jun 7, 2018 •

edited

kuraga commented Jun 13, 2018 •

edited

nynorbert commented Jun 13, 2018 •

edited