Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cyclic GC issues #2659

Closed
wesm opened this issue Jan 8, 2013 · 20 comments
Closed

Cyclic GC issues #2659

wesm opened this issue Jan 8, 2013 · 20 comments
Labels
Milestone

Comments

@wesm
Copy link
Member

wesm commented Jan 8, 2013

A mystery to be debugged soon:

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

def leak():
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()
@wesm
Copy link
Member Author

wesm commented Jan 8, 2013

Ok, this is, in a word, f*cked up. If I add gc.collect to that for loop it stops leaking memory:

import pandas as pd
import numpy as np
import gc

arr = np.random.randn(100000, 5)

def leak():
    pd.util.testing.set_trace()
    for i in xrange(10000):
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        gc.collect()
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

There are objects here that only get garbage collected when the cyclic GC runs. What's the solution here, break cycle explicitly in __del__ so the Python memory allocator stops screwing us?

@cournape
Copy link

cournape commented Jan 8, 2013

Can you try this:

from ctypes import cdll, CDLL

import pandas as pd
import numpy as np

arr = np.random.randn(100000, 5)

cdll.LoadLibrary("libc.so.6")
libc = CDLL("libc.so.6")

def leak():
    for i in xrange(10000):
        libc.malloc_trim(0)
        df = pd.DataFrame(arr.copy())
        result = df.xs(1000)
        # result = df.ix[5000]

if __name__ == '__main__':
    leak()

I suspect this has nothing to do with python, but that would confirm it.

@wesm
Copy link
Member Author

wesm commented Jan 8, 2013

Yeah, that seemed to do the trick. Memory usage 450MB after running that in IPython, then malloc_trim freed 400MB. Very pernicious

@ghost
Copy link

ghost commented Mar 18, 2013

Following the malloc_trim lead upstream, this looks like a glibc optimization gone awry.
xref:
http://sourceware.org/bugzilla/show_bug.cgi?id=14827

see "fastbins" comment.

In [1]: from ctypes import Structure,c_int,cdll,CDLL
   ...: class MallInfo(Structure):   
   ...:     _fields_ =[
   ...:               ( 'arena',c_int ),  #  /* Non-mmapped space allocated (bytes) */
   ...:            ('ordblks',c_int  ),# /* Number of free chunks */
   ...:            (    'smblks',c_int ),  # /* Number of free fastbin blocks */
   ...:            (    'hblks',c_int  ),  #/* Number of mmapped regions */
   ...:            (    'hblkhd' ,c_int ), #/* Space allocated in mmapped regions (bytes) */
   ...:            (    'usmblks' ,c_int), # /* Maximum total allocated space (bytes) */
   ...:            (    'fsmblks' ,c_int) ,#/* Space in freed fastbin blocks (bytes) */
   ...:            (    'uordblks' ,c_int),# /* Total allocated space (bytes) */
   ...:            (    'fordblks',c_int ),# /* Total free space (bytes) */
   ...:            (    'keepcost',c_int )# /* Top-most, releasable space (bytes) */
   ...:          ]
   ...:     def __repr__(self):
   ...:         return "\n".join(["%s:%d" % (k,getattr(self,k)) for k,v in self._fields_])
   ...: 
   ...: cdll.LoadLibrary("libc.so.6")
   ...: libc = CDLL("libc.so.6")
   ...: mallinfo=libc.mallinfo
   ...: mallinfo.restype=MallInfo
   ...: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[1]: 0

In [2]: import numpy as np
   ...: import pandas as pd
   ...: arr = np.random.randn(100000, 5)
   ...: def leak():
   ...:     for i in xrange(10000):
   ...:         df = pd.DataFrame(arr.copy())
   ...:         result = df.xs(1000)
   ...: leak()
   ...: mallinfo().fsmblks
Out[2]: 128

In [3]: libc.malloc_trim(0)
   ...: mallinfo().fsmblks
Out[3]: 0

@wesm
Copy link
Member Author

wesm commented Mar 28, 2013

Won't fix then. Maybe we should add some helper functions to pandas someday to do the malloc trimming

@kuraga
Copy link

kuraga commented Jun 15, 2018

Entry in FAQ, maybe?

@alanjds
Copy link

alanjds commented Aug 22, 2018

For the record, we (+@sbneto) are using this in preduction for a bit of time, and is doing very good:

# monkeypatches.py

# Solving memory leak problem in pandas
# https://github.com/pandas-dev/pandas/issues/2659#issuecomment-12021083
import pandas as pd
from ctypes import cdll, CDLL
try:
    cdll.LoadLibrary("libc.so.6")
    libc = CDLL("libc.so.6")
    libc.malloc_trim(0)
except (OSError, AttributeError):
    libc = None

__old_del = getattr(pd.DataFrame, '__del__', None)

def __new_del(self):
    if __old_del:
        __old_del(self)
    libc.malloc_trim(0)

if libc:
    print('Applying monkeypatch for pd.DataFrame.__del__', file=sys.stderr)
    pd.DataFrame.__del__ = __new_del
else:
    print('Skipping monkeypatch for pd.DataFrame.__del__: libc or malloc_trim() not found', file=sys.stderr)

@kuraga
Copy link

kuraga commented Oct 8, 2018

@alanjds thanks very much!

But there are other affected operations :-(

It's VERY strange that's issue above (issue of glibc) doesn't have any reactions. It affects ALL the environment of Linux PCs and servers. And... Nothing!!!

I know, you'll say me: ok, write a patch! I'll do it (UPD: but it'll be strange cause I know nothing about glibc code). But even nobody knows it.

Everybody say: KDE leaks. Who know - why?! Nobody!

Open source? For shame! Sorry but it's true for this situation.

P.S. http://sourceware.org/bugzilla/show_bug.cgi?id=14827

@alanjds
Copy link

alanjds commented Oct 8, 2018

I do believe in you. 2 years and no move on that side :/

I say to fix on this side and put a huge comment of blame, because forking there looks unfeasible.

@tchristensenowlet
Copy link

@alanjds Your code fixed a problem for me that was causing a major headache. Would you be willing to explain what the default pandas behavior is and how your code fixes it?

@xhochy
Copy link
Contributor

xhochy commented Jan 24, 2019

You can also work around this issue by switching to jemalloc as your default allocator. Instead of python script.py, run LD_PRELOAD=/usr/lib/libjemalloc.so python script.py. Note that the path to libjemalloc.so may be different on your system and that you first need to install it with your package manager.

@sbneto
Copy link

sbneto commented Jan 24, 2019

@tchristensenowlet The problem seems to be in the malloc code of glibc. Apparently, the free implementation there does not respect a flag that should issue malloc_trim after a certain threshold, as you can see in @ghost's link. Therefore, malloc_trim is never called and memory leaks. What we did was just to manually call malloc_trim if the lib is available in the system. We call it in the __del__() method, that is executed when the object is garbage collected.

@kuraga
Copy link

kuraga commented Jun 6, 2020

glibc.malloc.mxfast tunable has been introduced in Glibc (https://www.gnu.org/software/libc/manual/html_node/Memory-Allocation-Tunables.html).

@heetbeet
Copy link

heetbeet commented Aug 18, 2020

I think this might be the culprit in one of our project, but our users are running Windows with default Python 3.8 (from the official website) and with all dependencies installed via pip. Would this problem also be on Windows? If so, what would be the cdll.LoadLibrary("libc.so.6") equivalent?

Edit: I ran these test described here, and the garbage collected did its job properly every time:
#21353
System: Windows 10
Python: 3.8.5
Pandas: 1.1.0

@bhargav-kansagara
Copy link

Hi, I am also facing the same issue, libc.malloc_trim(0) frees up the memory in my local WSL2 setup. But the same does not seem to work in the docker image created on top of python:3.8.8-slim-buster. Do I need to install any package separately for the malloc_trim to work properly? Any pointers would be appreciated. Thank you!!

@anonymouse-jj
Copy link

anonymouse-jj commented Nov 23, 2021

@bhargav-kansagara I'm having the same issue as you with a buster vm. I can run libc.malloc_trim(0) with a return code 1 (successful) but no luck on releasing memory. Did you find any solutions?

@oalfred
Copy link

oalfred commented Jan 31, 2022

I am also interested in whether something extra needs to be installed or done in order to make the malloc_trim(0) fix work in a python debian buster docker image.

@xhochy
Copy link
Contributor

xhochy commented Jan 31, 2022

I am also interested in whether something extra needs to be installed or done in order to make the malloc_trim(0) fix work in a python debian buster docker image.

No, this should work out-of-the-box with Python itself.

@isik-kaplan
Copy link

isik-kaplan commented May 21, 2022

I've been trying understand why my memory usage wasn't flat after using chunksize argument. I'm using Cpython so I never taught about using garbage collector manually since cpython "guarantees" (AFAIK) garbage collection right after reference count hits zero. I spent 2 days on this and seeing this issue and a couple of simple gc.collect(), and seems like it fixed(most of it at least) my problem right away.

I haven't seen anything related to this in the documentation(ignore this if there is a warning already) but I think documentation should have a BIG warning about this.

@wesm @alanjds Thank you both!

@I506dk
Copy link

I506dk commented Sep 26, 2022

For anyone else that sees this, this issue still exists as of 9/2022. Reading in pieces of a csv file (csv file is 21gb, and contains a little over 600 million rows), and even using chunks helps, but only delays the problem. Following each chunk read in with del chunk and gc.collect() does indeed help, and will likely work for most, however due to the size of dataset it just took longer for the program to crash. As a solution, I moved to a dask.dataframe (at minimum for just the read_csv). This alleviated my issue, and allows you to break a dataframe into "partitions" to fit your memory constraints.

import dask.dataframe as dd

# Read csv file. Will look at all of it and break it into partitions. Does not read every partition into memory.
# I only had 2 columns, and I specified the dtypes as well.
Hash_Frame = dd.read_csv(Full_Path, sep=':', blocksize=Split_Limit, header=None, dtype={0:"string", 1:"int64"})
      
# Calculate the total number of partitions
total_partitions = int(Hash_Frame.npartitions)

i = 0
while i < total_partitions:
    # Get the current partition (which is just a dataframe)
    Current_Frame = Hash_Frame.partitions[i]
                
    # Convert dask dataframe to pandas dataframe (if need be)
    Current_Frame = Current_Frame.compute()

    # Do whatever else here
    i += 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests