Cyclic GC issues #2659

wesm · 2013-01-08T21:33:27Z

A mystery to be debugged soon:

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

def leak():
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

wesm · 2013-01-08T21:52:27Z

Ok, this is, in a word, f*cked up. If I add gc.collect to that for loop it stops leaking memory:

import pandas as pd
import numpy as np
import gc

arr = np.random.randn(100000, 5)

def leak():
    pd.util.testing.set_trace()
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        gc.collect()
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

There are objects here that only get garbage collected when the cyclic GC runs. What's the solution here, break cycle explicitly in __del__ so the Python memory allocator stops screwing us?

cournape · 2013-01-08T22:31:21Z

Can you try this:

from ctypes import cdll, CDLL

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")

def leak():
    for i in xrange(10000):
        libc.malloc_trim(0)
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

I suspect this has nothing to do with python, but that would confirm it.

wesm · 2013-01-08T22:44:04Z

Yeah, that seemed to do the trick. Memory usage 450MB after running that in IPython, then malloc_trim freed 400MB. Very pernicious

ghost · 2013-03-18T04:53:35Z

Following the malloc_trim lead upstream, this looks like a glibc optimization gone awry.
xref:
http://sourceware.org/bugzilla/show_bug.cgi?id=14827

see "fastbins" comment.

In [1]: from ctypes import Structure,c_int,cdll,CDLL
   ...: class MallInfo(Structure):   
   ...:     _fields_ =[
   ...:               ( 'arena',c_int ),  #  /* Non-mmapped space allocated (bytes) */
   ...:            ('ordblks',c_int  ),# /* Number of free chunks */
   ...:            (    'smblks',c_int ),  # /* Number of free fastbin blocks */
   ...:            (    'hblks',c_int  ),  #/* Number of mmapped regions */
   ...:            (    'hblkhd' ,c_int ), #/* Space allocated in mmapped regions (bytes) */
   ...:            (    'usmblks' ,c_int), # /* Maximum total allocated space (bytes) */
   ...:            (    'fsmblks' ,c_int) ,#/* Space in freed fastbin blocks (bytes) */
   ...:            (    'uordblks' ,c_int),# /* Total allocated space (bytes) */
   ...:            (    'fordblks',c_int ),# /* Total free space (bytes) */
   ...:            (    'keepcost',c_int )# /* Top-most, releasable space (bytes) */
   ...:          ]
   ...:     def __repr__(self):
   ...:         return "\n".join(["%s:%d" % (k,getattr(self,k)) for k,v in self._fields_])
   ...: 
   ...: cdll.LoadLibrary("libc.so.6")
   ...: libc = CDLL("libc.so.6")
   ...: mallinfo=libc.mallinfo
   ...: mallinfo.restype=MallInfo
   ...: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[1]: 0

In [2]: import numpy as np
   ...: import pandas as pd
   ...: arr = np.random.randn(100000, 5)
   ...: def leak():
   ...:     for i in xrange(10000):
   ...:         df = pd.DataFrame(arr.copy())
   ...:         result = df.xs(1000)
   ...: leak()
   ...: mallinfo().fsmblks
Out[2]: 128

In [3]: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[3]: 0

wesm · 2013-03-28T05:09:16Z

Won't fix then. Maybe we should add some helper functions to pandas someday to do the malloc trimming

kuraga · 2018-06-15T13:59:20Z

Entry in FAQ, maybe?

alanjds · 2018-08-22T20:58:19Z

For the record, we (+@sbneto) are using this in preduction for a bit of time, and is doing very good:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

kuraga · 2018-10-08T15:27:18Z

@alanjds thanks very much!

But there are other affected operations :-(

It's VERY strange that's issue above (issue of glibc) doesn't have any reactions. It affects ALL the environment of Linux PCs and servers. And... Nothing!!!

I know, you'll say me: ok, write a patch! I'll do it (UPD: but it'll be strange cause I know nothing about glibc code). But even nobody knows it.

Everybody say: KDE leaks. Who know - why?! Nobody!

Open source? For shame! Sorry but it's true for this situation.

P.S. http://sourceware.org/bugzilla/show_bug.cgi?id=14827

alanjds · 2018-10-08T15:55:52Z

I do believe in you. 2 years and no move on that side :/

I say to fix on this side and put a huge comment of blame, because forking there looks unfeasible.

tchristensenowlet · 2019-01-23T07:21:09Z

@alanjds Your code fixed a problem for me that was causing a major headache. Would you be willing to explain what the default pandas behavior is and how your code fixes it?

xhochy · 2019-01-24T13:19:57Z

You can also work around this issue by switching to jemalloc as your default allocator. Instead of python script.py, run LD_PRELOAD=/usr/lib/libjemalloc.so python script.py. Note that the path to libjemalloc.so may be different on your system and that you first need to install it with your package manager.

sbneto · 2019-01-24T21:12:35Z

@tchristensenowlet The problem seems to be in the malloc code of glibc. Apparently, the free implementation there does not respect a flag that should issue malloc_trim after a certain threshold, as you can see in @ghost's link. Therefore, malloc_trim is never called and memory leaks. What we did was just to manually call malloc_trim if the lib is available in the system. We call it in the __del__() method, that is executed when the object is garbage collected.

kuraga · 2020-06-06T11:27:56Z

glibc.malloc.mxfast tunable has been introduced in Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html).

heetbeet · 2020-08-18T22:28:00Z

I think this might be the culprit in one of our project, but our users are running Windows with default Python 3.8 (from the official website) and with all dependencies installed via pip. Would this problem also be on Windows? If so, what would be the cdll.LoadLibrary("libc.so.6") equivalent?

Edit: I ran these test described here, and the garbage collected did its job properly every time:
#21353
System: Windows 10
Python: 3.8.5
Pandas: 1.1.0

bhargav-kansagara · 2021-10-07T08:55:09Z

Hi, I am also facing the same issue, libc.malloc_trim(0) frees up the memory in my local WSL2 setup. But the same does not seem to work in the docker image created on top of python:3.8.8-slim-buster. Do I need to install any package separately for the malloc_trim to work properly? Any pointers would be appreciated. Thank you!!

anonymouse-jj · 2021-11-23T23:31:06Z

@bhargav-kansagara I'm having the same issue as you with a buster vm. I can run libc.malloc_trim(0) with a return code 1 (successful) but no luck on releasing memory. Did you find any solutions?

oalfred · 2022-01-31T10:58:35Z

I am also interested in whether something extra needs to be installed or done in order to make the malloc_trim(0) fix work in a python debian buster docker image.

xhochy · 2022-01-31T14:38:10Z

I am also interested in whether something extra needs to be installed or done in order to make the malloc_trim(0) fix work in a python debian buster docker image.

No, this should work out-of-the-box with Python itself.

isik-kaplan · 2022-05-21T14:45:29Z

I've been trying understand why my memory usage wasn't flat after using chunksize argument. I'm using Cpython so I never taught about using garbage collector manually since cpython "guarantees" (AFAIK) garbage collection right after reference count hits zero. I spent 2 days on this and seeing this issue and a couple of simple gc.collect(), and seems like it fixed(most of it at least) my problem right away.

I haven't seen anything related to this in the documentation(ignore this if there is a warning already) but I think documentation should have a BIG warning about this.

@wesm @alanjds Thank you both!

I506dk · 2022-09-26T22:04:17Z

For anyone else that sees this, this issue still exists as of 9/2022. Reading in pieces of a csv file (csv file is 21gb, and contains a little over 600 million rows), and even using chunks helps, but only delays the problem. Following each chunk read in with del chunk and gc.collect() does indeed help, and will likely work for most, however due to the size of dataset it just took longer for the program to crash. As a solution, I moved to a dask.dataframe (at minimum for just the read_csv). This alleviated my issue, and allows you to break a dataframe into "partitions" to fit your memory constraints.

import dask.dataframe as dd

# Read csv file. Will look at all of it and break it into partitions. Does not read every partition into memory.
# I only had 2 columns, and I specified the dtypes as well.
Hash_Frame = dd.read_csv(Full_Path, sep=':', blocksize=Split_Limit, header=None, dtype={0:"string", 1:"int64"})
      
# Calculate the total number of partitions
total_partitions = int(Hash_Frame.npartitions)

i = 0
while i < total_partitions:
    # Get the current partition (which is just a dataframe)
    Current_Frame = Hash_Frame.partitions[i]
                
    # Convert dask dataframe to pandas dataframe (if need be)
    Current_Frame = Current_Frame.compute()

    # Do whatever else here
    i += 1

wesm closed this as completed Mar 28, 2013

ghost mentioned this issue Apr 5, 2013

PERF/TST: Figure out what's going on with vb_suite test result instability (frame_reindex_* in particular) #3158

Closed

ghost mentioned this issue Apr 14, 2013

Possible cyclic references in pandas #3351

Closed

ghost mentioned this issue Jan 23, 2014

Memory leak? #6046

Closed

wdwvt1 mentioned this issue Aug 16, 2016

memory leak caporaso-lab/sourcetracker2#58

Closed

nynorbert mentioned this issue Jun 13, 2018

Memory leak in pd.read_csv or DataFrame #21353

Closed

Shirui816 mentioned this issue Jun 19, 2018

pandas.read_csv leaks memory while opening massive files with chunksize & iterator=True #21516

Closed

OmegaDroid mentioned this issue Jan 14, 2020

Worker not releasing memory from Python process OasisLMF/OasisPlatform#283

Closed

fxjung mentioned this issue Apr 14, 2021

Memory leak PhysicsOfMobility/ridepy#133

Closed

dPeS mentioned this issue Aug 5, 2021

story: add max memory cap to pyston pyston/pyston#73

Open

This was referenced Dec 20, 2021

BUG: Memory leak with DataFrame.copy #44983

Closed

BUG: Memory leak with pandas.concat #44982

Closed

BUG: Memory leak with Series.str.contains #44981

Closed

robyngit mentioned this issue Nov 2, 2022

Data leak using Ray Core and Ray Workflows PermafrostDiscoveryGateway/viz-workflow#3

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cyclic GC issues #2659

Cyclic GC issues #2659

wesm commented Jan 8, 2013

wesm commented Jan 8, 2013

cournape commented Jan 8, 2013

wesm commented Jan 8, 2013

ghost commented Mar 18, 2013

wesm commented Mar 28, 2013

kuraga commented Jun 15, 2018

alanjds commented Aug 22, 2018

kuraga commented Oct 8, 2018 •

edited

alanjds commented Oct 8, 2018

tchristensenowlet commented Jan 23, 2019

xhochy commented Jan 24, 2019

sbneto commented Jan 24, 2019

kuraga commented Jun 6, 2020

heetbeet commented Aug 18, 2020 •

edited

bhargav-kansagara commented Oct 7, 2021

anonymouse-jj commented Nov 23, 2021 •

edited

oalfred commented Jan 31, 2022

xhochy commented Jan 31, 2022

isik-kaplan commented May 21, 2022 •

edited

I506dk commented Sep 26, 2022

Cyclic GC issues #2659

Cyclic GC issues #2659

Comments

wesm commented Jan 8, 2013

wesm commented Jan 8, 2013

cournape commented Jan 8, 2013

wesm commented Jan 8, 2013

ghost commented Mar 18, 2013

wesm commented Mar 28, 2013

kuraga commented Jun 15, 2018

alanjds commented Aug 22, 2018

kuraga commented Oct 8, 2018 • edited

alanjds commented Oct 8, 2018

tchristensenowlet commented Jan 23, 2019

xhochy commented Jan 24, 2019

sbneto commented Jan 24, 2019

kuraga commented Jun 6, 2020

heetbeet commented Aug 18, 2020 • edited

bhargav-kansagara commented Oct 7, 2021

anonymouse-jj commented Nov 23, 2021 • edited

oalfred commented Jan 31, 2022

xhochy commented Jan 31, 2022

isik-kaplan commented May 21, 2022 • edited

I506dk commented Sep 26, 2022

kuraga commented Oct 8, 2018 •

edited

heetbeet commented Aug 18, 2020 •

edited

anonymouse-jj commented Nov 23, 2021 •

edited

isik-kaplan commented May 21, 2022 •

edited