Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak? #6046

Closed
Marigold opened this issue Jan 23, 2014 · 17 comments
Closed

Memory leak? #6046

Marigold opened this issue Jan 23, 2014 · 17 comments

Comments

@Marigold
Copy link

I tried to run the following code (with master)

import pandas as pd
import psutil
import gc

print('Before', psutil.phymem_usage())
df = pd.DataFrame({'a': range(int(4e7))})
print('After', psutil.phymem_usage())
del df
gc.collect()
print('After deleting', psutil.phymem_usage())

and got these results

('Before', usage(total=6132502528L, used=2558156800L, free=3574345728L, percent=26.5))
('After', usage(total=6132502528L, used=4144177152L, free=1988325376L, percent=52.9))
('After deleting', usage(total=6132502528L, used=3503177728L, free=2629324800L, percent=42.5))

Is it a memory leak or am I doing something wrong?

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

doing this is in a loop shows no problem

('Before', usage(total=33802862592L, used=30500061184L, free=3302801408L, percent=28.7))
('After', usage(total=33802862592L, used=30663581696L, free=3139280896L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('Before', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('After', usage(total=33802862592L, used=30695452672L, free=3107409920L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('Before', usage(total=33802862592L, used=30631075840L, free=3171786752L, percent=29.1))
('After', usage(total=33802862592L, used=30695198720L, free=3107663872L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('Before', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('After', usage(total=33802862592L, used=30695419904L, free=3107442688L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('Before', usage(total=33802862592L, used=30631424000L, free=3171438592L, percent=29.1))
('After', usage(total=33802862592L, used=30695641088L, free=3107221504L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('Before', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('After', usage(total=33802862592L, used=30695641088L, free=3107221504L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('Before', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('After', usage(total=33802862592L, used=30695768064L, free=3107094528L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631391232L, free=3171471360L, percent=29.1))
('Before', usage(total=33802862592L, used=30631391232L, free=3171471360L, percent=29.1))
('After', usage(total=33802862592L, used=30695768064L, free=3107094528L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631518208L, free=3171344384L, percent=29.1))
('Before', usage(total=33802862592L, used=30631518208L, free=3171344384L, percent=29.1))
('After', usage(total=33802862592L, used=30695641088L, free=3107221504L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('Before', usage(total=33802862592L, used=30631645184L, free=3171217408L, percent=29.1))
('After', usage(total=33802862592L, used=30695768064L, free=3107094528L, percent=29.2))
('After deleting', usage(total=33802862592L, used=30631772160L, free=3171090432L, percent=29.1))

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

python 'holds' onto memory even after allocation; it reuses if for the next allocation.

A leak would show this steadily increasing.

@Marigold
Copy link
Author

Cool. Is it possible to release the memory somehow? I have a huge memory problems when reading HDFs. When I am done reading the file, the final dataframe takes just about 20% of memory used for reading.

@ghost
Copy link

ghost commented Jan 23, 2014

possibly #2659 has something for you.
or the python gc module.

Closing as not a bug.

@ghost ghost closed this as completed Jan 23, 2014
@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

back to the os; I don't think their is any way to do this (except by exiting the process). If you are processing HDF; use the chunk iterator if possible, that way it won't increase too much. I process hdf in this way, that is I run a process to do a computation (and create an output / new hdf file). Then exit the process. (I a actually multi-process this as the computations and output files are independent).

see this recent question (the bottom of my answer) for a nice pattern: http://stackoverflow.com/questions/21295329/fastest-way-to-copy-columns-from-one-dataframe-to-another-using-pandas/21296133?noredirect=1#comment32114620_21296133

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@Marigold if you post your code for what you are doing I can take a look

@Marigold
Copy link
Author

I am trying to merge HDF files on disk (very painful experience). I ended up using smaller chunks too, for now it seems quite ok.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

@Marigold ok...lmk...as I said I do this a lot; it shouldn't be painful :)

You are using HDFStroe yes?

@Marigold
Copy link
Author

I saw your post on SO, very nice solution. However, I need to do an outer join. Dont know if it can be done in a similar way. At the moment I have like three types of joins - one sorts the data first (sorting takes ages) and then iterate over both of them, second merges indices first and select data with where=Index (very slow too) and the third one find min and max values of index and then select it with where=[index] (which is unfortunately unable to do outer join).

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

sounds like an interesting problem....can you put up in memory what you are doing (e.g. an example which is all memory based).....? I can think about it

@Marigold
Copy link
Author

Here is a little more complex example

import pandas as pd

left = pd.DataFrame({'a': [0] * 5, 'b': [1] * 5})
right = pd.DataFrame({'a': range(5), 'c': ['a' * i for i in range(5)]})

left.merge(right, on='a', how='outer')

It shows all the possible problems - outer join, NaN values for int columns (was int, but now it has to be float because of NaN) and min_itemsize for string (although this can be found easily in metadata).

I was thinking how to do the outer join with the example you provided and it seems pretty intuitive. When you have inner join, just go over both dataframes, look for index values not in inner join and append it to inner join afterwards. Unfortunately, it takes three nested iterations (as in your example) over dataframes.

@jreback
Copy link
Contributor

jreback commented Jan 23, 2014

you might be able to do this by just selecting the index values of the table (use store.select_column, see here: http://pandas.pydata.org/pandas-docs/dev/io.html#advanced-queries)

which gives you basically a frame which the index values AND an integer index, which is in fact the coordinates of those index values) - call these coordinates.

Then you can do your joins in memory (keeping around those coordinates), then select the final result using those coordinates. That way you don't actually bring in any data until you need it.

@Marigold
Copy link
Author

That's what I do now. It is the fastest method so far (still the selection by index is very slow), slightly faster than looping over two dataframes in your example. I don't think it can get any better, so I would close this discussion for now until something "new" appears. Thanks a lot for help!

@Marigold
Copy link
Author

Back to the original question - now with relation to hdf itself. I have a following script

import pandas as pd
import psutil
import gc

def create_hdf():
    d = pd.DataFrame({'a': range(int(2e7)), 'b': range(int(2e7))})
    d.to_hdf('test.h5', 'df', format='t')

def load_hdf(columns):
    before = psutil.phymem_usage()
    print('Before', before)
    d = pd.read_hdf('test.h5', 'df', columns=columns)
    gc.collect()
    print('MB used by dataframe itself: {:.2f}'.format(float(d.values.nbytes) / 2**20))
    after = psutil.phymem_usage()
    print('After', after)
    print('Memory change in MB {:.2f}'.format((after.used - before.used) / float(2**20)))

And here are the results for different columns

columns = None

('Before', usage(total=6132502528L, used=3364777984L, free=2767724544L, percent=22.6))
MB used by dataframe itself: 305.18
('After', usage(total=6132502528L, used=4186906624L, free=1945595904L, percent=36.0))
Memory change in MB 784.04

columns = []

('Before', usage(total=6132502528L, used=3391778816L, free=2740723712L, percent=22.8))
MB used by dataframe itself: 0.00
('After', usage(total=6132502528L, used=3893465088L, free=2239037440L, percent=31.0))
Memory change in MB 478.45

columns = ['a']

('Before', usage(total=6132502528L, used=3393855488L, free=2738647040L, percent=22.8))
MB used by dataframe itself: 152.59
('After', usage(total=6132502528L, used=4056539136L, free=2075963392L, percent=33.6))
Memory change in MB 631.98

I can limit this "memory leak" to some extent by using chunksize and then concatenate the results. Anyway, should this be default behavior?

@jreback
Copy link
Contributor

jreback commented Jan 24, 2014

setting columns only causes a reindex

hdf is row oriented so it will bring in ALL the columns no matter what u ask and just reindex to give you back what u want

if u want to limit peak memory definite use an iterator and concatenate

generally I try to work on smaller parts of my stores at once

if I need the entire thing u can chunk by iterator or by looping over another axis of the data and selecting (eg you can say select the unique values for a particular field then loop over those )

if u do heavily column oriented stuff you really need a column store

see this/ #4454 want to contribute on this?

@Marigold
Copy link
Author

Thanks for explanation. I'll definitely look at it and see if I can contribute with something.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants