Zernike memory issue #52

binarybottle · 2015-02-15T22:21:17Z

Mindboggle is crashing on many subjects with a memory limit set to 2G:

Satrajit Ghosh (2/15/2015):
150215-15:17:45,874 workflow INFO:
Executing node Zernike_sulci.a1 in dir: /om/scratch/Tue/ps/MB_work/734db8e05f6be469df79c1419f253ad7/Mindboggle/Surface_feature_shapes/_hemi_rh/Zernike_sulci
Load "sulci" scalars from sulci.vtk
8329 vertices for label 1
Reduced 160076 to 15921 triangular faces
srun: Exceeded job memory limit

brianthelion · 2015-02-18T21:28:40Z

What kind of behavior would we expect or prefer here? I'll take a look while I'm on cleanup duty.

satra · 2015-02-20T18:59:58Z

the basic question is what are the memory requirements of that function.

brianthelion · 2015-02-20T19:53:36Z

Sorry to get pedantic here, but @satra can you just give a few bullets or a shell script describing the workflow that produces the issue? For example, srun accepts several different memory-bounding arguments and I want to make sure that I know what's what.

If I understand correctly, the "ask" here is for a new feature in the Zernike code that returns a "required memory" estimate for a particular mesh. Is that accurate?

satra · 2015-02-23T13:26:41Z

@brianthelion - actually we just want to know what the memory requirements are. for example for a standard brain mesh size of about 2 hemi x (100k vertices + 100k faces), i'm seeing memory spikes upto 22 GB when running mindboggle. so we need to know/describe to users what are the memory requirements for the different steps. if i don't run the zernicke step of mindboggle, the rest of mindboggle appears to be satisfied with about 2G of ram.

brianthelion · 2015-02-24T04:42:50Z

My first-pass reading of the code is that the memory usage should only be a function of the moment order, N, and the number of worker threads. The size of the mesh is not a factor AFAICT.

More specifically, the thread pool iterates over the faces in the mesh, calculating the contribution of each face to the overall moment. Each worker in the pool initializes four arrays of size (N+1)^3 at a time. Each item of each array is 64 or 32 bits, depending on the CPU architecture. Assuming some typical values here of N=20 and a 64-bit machine, we're looking at 8*4*21^3 bytes or, roughly, 300KB per worker thread. That's not anywhere close to 22GB unless you're running something like 70,000 threads.

So, seems like there are a couple of possibilities:

The code that I think is running isn't the code that's actually running.
I've totally screwed up this back-of-napkin calculation somehow.
I've failed to follow the code that I wrote several months ago.
The user is setting a very high N.
The user is running with O(100,000) worker threads.
There is a very serious memory leak in the code.

If there is a very serious memory leak, we probably need to find it. @satra can you eliminate (4) and (5) as possibilities? Can you also tell us which function in the code is hitting the srun memory limit?

brianthelion · 2015-02-25T18:58:30Z

@binarybottle Are you able to reproduce this issue? My time to deliver a potential fix is pretty limited, so the sooner we can get it through triage the better.

binarybottle · 2015-02-25T19:48:48Z

Satra ran thousands of images on Amazon and on the MIT cluster, and found this spike on many cases, so I would defer to Satra's having adequately reproduced this problem.

brianthelion · 2015-02-25T20:23:35Z

I have absolute faith that @satra is seeing a real issue. I need a greater level of specificity, though, to determine whether or not there's an actual bug. @binarybottle if you can provide some basic debug output under the failure mode, that would get me a long way down the road.

satra · 2015-02-25T20:46:37Z

@brianthelion - i'll try to get more specifics this weekend.

brianthelion · 2015-02-25T21:44:21Z

Thanks, @satra!

binarybottle · 2015-02-25T21:47:53Z

Thank you, Satra!

brianthelion · 2015-03-05T13:52:33Z

Any updates here? Thanks!

satra · 2015-03-05T14:07:58Z

yes and no. i can replicate the error when i run it on it's own, but not when i'm memory profiling it.

i have limited it to the zernicke function. i had to step away from this - but i'll look at it again soon.

binarybottle · 2015-03-05T23:41:47Z

I will disable Zernike moments from the --all option until this is resolved...

brianthelion · 2015-03-19T19:48:28Z

@satra Any progress on this issue? Thanks!

brianthelion · 2015-04-21T17:26:13Z

Ping!

satra · 2015-04-23T00:55:46Z

pong! sorry i have had no time to test this issue. unfortunately this will require me to carve out an hour or two and those have been a little scarce! i do need to run a bunch of mindboggle output next month, so i will try to test it then.

satra · 2015-05-09T02:38:44Z

ok - so here is the issue:

https://github.com/binarybottle/mindboggle/blob/master/mindboggle/shapes/zernike/pipelines.py#L232

multiprocessing doesn't use shared memory and makes copies of every data bit that's passed through map_async which is used in the line above and in a few other places.

the Pool is initialized without any arguments, which would in my case generate 40 or 64 processes, each making a copy of the data. this is why it's breaking on our slurm job controller, but not on machines with 4-8 cores.

to allow for shared memory one would need to use sharedctypes:

https://docs.python.org/2/library/multiprocessing.html#module-multiprocessing.sharedctypes

brianthelion · 2015-05-11T20:38:23Z

Satra, Thanks for the follow-up! Do you actually mean line 352? The default pipeline is now KoehlMultiproc. See line 367.

satra · 2015-05-12T00:06:39Z

@brianthelion - it's actually the same pattern that's used throughout that file. i just used one example to highlight it.

binarybottle · 2016-02-29T06:43:00Z

@satra and @brianthelion -- Do Zernike moments still require 2GB RAM, or were you able to find a resolution to this issue?

brianthelion · 2016-03-16T20:32:22Z

@binarybottle What's needed here is support for allocating ndarrays in shared memory. On other projects I'd used this, but it seems that its currently un-maintained. I don't see other good implementations out there that support the standard numpy array interface, but maybe @satra can point at something.

Without shared memory support, the options are: (1) Continue to use KoehlMultiproc as the DefaultPipeline and assume that the memcopy won't blow out the system RAM, or exit with a graceful warning if we can detect that a memory error is coming; (2) switch the DefaultPipeline to a SerialPipeline as in this line.

binarybottle assigned brianthelion Feb 15, 2015

brianthelion mentioned this issue Feb 20, 2015

Zernike refactor #42

Closed

binarybottle unassigned brianthelion Sep 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zernike memory issue #52

Zernike memory issue #52

binarybottle commented Feb 15, 2015

brianthelion commented Feb 18, 2015

satra commented Feb 20, 2015

brianthelion commented Feb 20, 2015

satra commented Feb 23, 2015

brianthelion commented Feb 24, 2015

brianthelion commented Feb 25, 2015

binarybottle commented Feb 25, 2015

brianthelion commented Feb 25, 2015

satra commented Feb 25, 2015

brianthelion commented Feb 25, 2015

binarybottle commented Feb 25, 2015

brianthelion commented Mar 5, 2015

satra commented Mar 5, 2015

binarybottle commented Mar 5, 2015

brianthelion commented Mar 19, 2015

brianthelion commented Apr 21, 2015

satra commented Apr 23, 2015

satra commented May 9, 2015

brianthelion commented May 11, 2015

satra commented May 12, 2015

binarybottle commented Feb 29, 2016

brianthelion commented Mar 16, 2016

Zernike memory issue #52

Zernike memory issue #52

Comments

binarybottle commented Feb 15, 2015

brianthelion commented Feb 18, 2015

satra commented Feb 20, 2015

brianthelion commented Feb 20, 2015

satra commented Feb 23, 2015

brianthelion commented Feb 24, 2015

brianthelion commented Feb 25, 2015

binarybottle commented Feb 25, 2015

brianthelion commented Feb 25, 2015

satra commented Feb 25, 2015

brianthelion commented Feb 25, 2015

binarybottle commented Feb 25, 2015

brianthelion commented Mar 5, 2015

satra commented Mar 5, 2015

binarybottle commented Mar 5, 2015

brianthelion commented Mar 19, 2015

brianthelion commented Apr 21, 2015

satra commented Apr 23, 2015

satra commented May 9, 2015

brianthelion commented May 11, 2015

satra commented May 12, 2015

binarybottle commented Feb 29, 2016

brianthelion commented Mar 16, 2016