Memory leak? #6046

Marigold · 2014-01-23T15:54:24Z

I tried to run the following code (with master)

import pandas as pd
import psutil
import gc

print('Before', psutil.phymem_usage())
df = pd.DataFrame({'a': range(int(4e7))})
print('After', psutil.phymem_usage())
del df
gc.collect()
print('After deleting', psutil.phymem_usage())

and got these results

('Before', usage(total=6132502528L, used=2558156800L, free=3574345728L, percent=26.5))
('After', usage(total=6132502528L, used=4144177152L, free=1988325376L, percent=52.9))
('After deleting', usage(total=6132502528L, used=3503177728L, free=2629324800L, percent=42.5))

Is it a memory leak or am I doing something wrong?

The text was updated successfully, but these errors were encountered:

jreback · 2014-01-23T16:00:32Z

doing this is in a loop shows no problem

('Before', usage(total=33802862592L, used=30500061184L, free=3302801408L, percent=28.7))
('After', usage(total=33802862592L, used=30663581696L, free=3139280896L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('Before', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('After', usage(total=33802862592L, used=30695452672L, free=3107409920L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('Before', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('After', usage(total=33802862592L, used=30695198720L, free=3107663872L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('Before', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('After', usage(total=33802862592L, used=30695419904L, free=3107442688L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('Before', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('After', usage(total=33802862592L, used=30695641088L, free=3107221504L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('Before', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('After', usage(total=33802862592L, used=30695641088L, free=3107221504L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('Before', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('After', usage(total=33802862592L, used=30695768064L, free=3107094528L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631391232L, free=3171471360L, percent=29.1))
('Before', usage(total=33802862592L, used=30631391232L, free=3171471360L, percent=29.1))
('After', usage(total=33802862592L, used=30695768064L, free=3107094528L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631518208L, free=3171344384L, percent=29.1))
('Before', usage(total=33802862592L, used=30631518208L, free=3171344384L, percent=29.1))
('After', usage(total=33802862592L, used=30695641088L, free=3107221504L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('Before', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('After', usage(total=33802862592L, used=30695768064L, free=3107094528L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631772160L, free=3171090432L, percent=29.1))

jreback · 2014-01-23T16:02:50Z

python 'holds' onto memory even after allocation; it reuses if for the next allocation.

A leak would show this steadily increasing.

Marigold · 2014-01-23T16:35:35Z

Cool. Is it possible to release the memory somehow? I have a huge memory problems when reading HDFs. When I am done reading the file, the final dataframe takes just about 20% of memory used for reading.

ghost · 2014-01-23T16:44:09Z

possibly #2659 has something for you.
or the python gc module.

Closing as not a bug.

jreback · 2014-01-23T16:44:26Z

back to the os; I don't think their is any way to do this (except by exiting the process). If you are processing HDF; use the chunk iterator if possible, that way it won't increase too much. I process hdf in this way, that is I run a process to do a computation (and create an output / new hdf file). Then exit the process. (I a actually multi-process this as the computations and output files are independent).

see this recent question (the bottom of my answer) for a nice pattern: http://stackoverflow.com/questions/21295329/fastest-way-to-copy-columns-from-one-dataframe-to-another-using-pandas/21296133?noredirect=1#comment32114620_21296133

jreback · 2014-01-23T16:44:43Z

@Marigold if you post your code for what you are doing I can take a look

Marigold · 2014-01-23T18:10:22Z

I am trying to merge HDF files on disk (very painful experience). I ended up using smaller chunks too, for now it seems quite ok.

jreback · 2014-01-23T18:14:51Z

@Marigold ok...lmk...as I said I do this a lot; it shouldn't be painful :)

You are using HDFStroe yes?

jreback · 2014-01-23T18:30:36Z

you may find these useful: http://pandas.pydata.org/pandas-docs/dev/cookbook.html#hdfstore

specifically this:
http://stackoverflow.com/questions/14614512/merging-two-tables-with-millions-of-rows-in-python/14617925#14617925

Marigold · 2014-01-23T20:21:53Z

I saw your post on SO, very nice solution. However, I need to do an outer join. Dont know if it can be done in a similar way. At the moment I have like three types of joins - one sorts the data first (sorting takes ages) and then iterate over both of them, second merges indices first and select data with where=Index (very slow too) and the third one find min and max values of index and then select it with where=[index] (which is unfortunately unable to do outer join).

jreback · 2014-01-23T20:23:54Z

sounds like an interesting problem....can you put up in memory what you are doing (e.g. an example which is all memory based).....? I can think about it

Marigold · 2014-01-23T21:05:50Z

Here is a little more complex example

import pandas as pd

left = pd.DataFrame({'a': [0] * 5, 'b': [1] * 5})
right = pd.DataFrame({'a': range(5), 'c': ['a' * i for i in range(5)]})

left.merge(right, on='a', how='outer')

It shows all the possible problems - outer join, NaN values for int columns (was int, but now it has to be float because of NaN) and min_itemsize for string (although this can be found easily in metadata).

I was thinking how to do the outer join with the example you provided and it seems pretty intuitive. When you have inner join, just go over both dataframes, look for index values not in inner join and append it to inner join afterwards. Unfortunately, it takes three nested iterations (as in your example) over dataframes.

jreback · 2014-01-23T21:10:42Z

you might be able to do this by just selecting the index values of the table (use store.select_column, see here: http://pandas.pydata.org/pandas-docs/dev/io.html#advanced-queries)

which gives you basically a frame which the index values AND an integer index, which is in fact the coordinates of those index values) - call these coordinates.

Then you can do your joins in memory (keeping around those coordinates), then select the final result using those coordinates. That way you don't actually bring in any data until you need it.

Marigold · 2014-01-23T21:39:15Z

That's what I do now. It is the fastest method so far (still the selection by index is very slow), slightly faster than looping over two dataframes in your example. I don't think it can get any better, so I would close this discussion for now until something "new" appears. Thanks a lot for help!

Marigold · 2014-01-24T10:44:15Z

Back to the original question - now with relation to hdf itself. I have a following script

import pandas as pd
import psutil
import gc

def create_hdf():
    d = pd.DataFrame({'a': range(int(2e7)), 'b': range(int(2e7))})
    d.to_hdf('test.h5', 'df', format='t')

def load_hdf(columns):
    before = psutil.phymem_usage()
    print('Before', before)
    d = pd.read_hdf('test.h5', 'df', columns=columns)
    gc.collect()
    print('MB used by dataframe itself: {:.2f}'.format(float(d.values.nbytes) / 2**20))
    after = psutil.phymem_usage()
    print('After', after)
    print('Memory change in MB {:.2f}'.format((after.used - before.used) / float(2**20)))

And here are the results for different columns

columns = None

('Before', usage(total=6132502528L, used=3364777984L, free=2767724544L, percent=22.6))
MB used by dataframe itself: 305.18
('After', usage(total=6132502528L, used=4186906624L, free=1945595904L, percent=36.0))
Memory change in MB 784.04

columns = []

('Before', usage(total=6132502528L, used=3391778816L, free=2740723712L, percent=22.8))
MB used by dataframe itself: 0.00
('After', usage(total=6132502528L, used=3893465088L, free=2239037440L, percent=31.0))
Memory change in MB 478.45

columns = ['a']

('Before', usage(total=6132502528L, used=3393855488L, free=2738647040L, percent=22.8))
MB used by dataframe itself: 152.59
('After', usage(total=6132502528L, used=4056539136L, free=2075963392L, percent=33.6))
Memory change in MB 631.98

I can limit this "memory leak" to some extent by using chunksize and then concatenate the results. Anyway, should this be default behavior?

jreback · 2014-01-24T11:42:19Z

setting columns only causes a reindex

hdf is row oriented so it will bring in ALL the columns no matter what u ask and just reindex to give you back what u want

if u want to limit peak memory definite use an iterator and concatenate

generally I try to work on smaller parts of my stores at once

if I need the entire thing u can chunk by iterator or by looping over another axis of the data and selecting (eg you can say select the unique values for a particular field then loop over those )

if u do heavily column oriented stuff you really need a column store

see this/ #4454 want to contribute on this?

Marigold · 2014-01-24T12:08:13Z

Thanks for explanation. I'll definitely look at it and see if I can contribute with something.

ghost closed this as completed Jan 23, 2014

jreback mentioned this issue Aug 6, 2016

memory leak in reindex #13922

Closed

jreback mentioned this issue Jan 30, 2018

Dataframe w/ DatetimeIndex memory leak #19459

Closed

This issue was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak? #6046

Memory leak? #6046

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

ghost commented Jan 23, 2014

jreback commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

Marigold commented Jan 24, 2014

jreback commented Jan 24, 2014

Marigold commented Jan 24, 2014

Memory leak? #6046

Memory leak? #6046

Comments

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

ghost commented Jan 23, 2014

jreback commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

jreback commented Jan 23, 2014

Marigold commented Jan 23, 2014

Marigold commented Jan 24, 2014

columns = None

columns = []

columns = ['a']

jreback commented Jan 24, 2014

Marigold commented Jan 24, 2014