Memory errors with refactored code #1580

keflavich · 2023-07-11T17:20:03Z

As noted in another thread, I'm consistently getting out-of-memory errors when running the new PSFPhotometry fitter.

My fitting runs have died at the following stages as ID'd by the progressbar:

Fit source/group:   6%|▋         | 11347/177215 [05:24<1:34:59, 29.10it/s]
Fit source/group:   5%|▍         | 11405/228909 [05:57<30:55:50,  1.95it/s]
Fit source/group:   4%|▍         | 11486/262107 [07:02<26:22:39,  2.64it/s]
Fit source/group:  11%|█         | 11379/102664 [06:45<2:06:09, 12.06it/s]
Fit source/group:   2%|▏         | 11396/591444 [06:51<8:59:34, 17.92it/s]

These are pretty consistent endpoints.

I suspect the problem is that fit_info is being stored in memory. iirc, fit_info includes at least one, and maybe several, copies of the data. Can we minimized the fit_info before storing it? I think only param_cov is used downstream?

Note that I have 256GB of memory allocated for these runs, which imo is a very large amount to dedicate to photometry of a single JWST field-of-view.

The text was updated successfully, but these errors were encountered:

larrybradley · 2023-07-11T17:36:46Z

I also suspect that the fit_info dictionary is the cause. It doesn't store a copy of the input data, but it does store the output from the fitters, which includes things like the fit residual, jacobian, etc. In general these should be small arrays (usually 5x5 is all that is needed for fitting since that is where most of the flux lies; the size determined by the fit_shape keyword), but I can see how that can add up when you have ~200k stars!

I'll want to keep at the least the fit residuals and the return status message. I'll remove the rest (perhaps as an option since I think your use case is probably on the extreme end). Some people may want all the fit info details.

larrybradley · 2023-07-11T17:46:36Z

Just curious -- what fit_shape are you using?

keflavich · 2023-07-11T18:03:13Z

11x11. If I switch to 5x5, I'd roughly expect to get to 4x more sources...

keflavich · 2023-07-11T18:07:53Z

...assuming only one footprint, of course, which is probably an underestimate

keflavich · 2023-07-11T20:03:11Z

Reducing fit_shape to 5,5 had no effect, which surprises me.

keflavich · 2023-07-11T20:19:45Z

I'm trying with a hack, changing:

                fit_info = self.fitter.fit_info.copy()

to

                fit_info = {key: self.fitter.fit_info.get(key)
                            for key in
                            ('param_cov', 'fvec', 'fun', 'ierr', 'status')
                           }

larrybradley · 2023-07-11T20:30:44Z

I did some testing, and I don't think the fit_info dict is the cause. I fit 15,000 stars (your failures were at <12,000 stars) with a fit_shape = (11, 11) and the fit_results size is only 194 MB. The PSF phot object total is 199 MB. The peak memory during the fitting was 7.7 GB. This was using a IntegratedGaussianPRF model. And I did not use grouping.

My next suspect is the PSF model. Are you using a GriddedPSFModel with very large (internal) PSF arrays and/or a large number of them?

larrybradley · 2023-07-11T20:33:20Z

Could you please send me your input PSF model?

keflavich · 2023-07-11T20:37:09Z

yes, I'm using a webbpsf model. Can be reproduced with:

                    import webbpsf
                    obsdate = '2022-08-28'
                    nrc = webbpsf.NIRCam()
                    nrc.load_wss_opd_by_date(f'{obsdate}T00:00:00')
                    nrc.filter = 'F405N'
                    nrc.detector = 'NRCA5'
                    grid = nrc.psf_grid(num_psfs=16, all_detectors=False, verbose=True, save=True)
                    psf_model = grid

I think... I haven't tested this; in production, the obsdate and some other variables come from FITS headers

EDIT: tested, this works now.

larrybradley · 2023-07-11T20:59:44Z

Thanks. Your PSF model is ~20 MB. 12_000 of them is ~233 GB (just for the PSF models, not the data, results, etc.). So that seems to be the culprit. The code returns a copy of the fit models. But it's copying the entire model. For the GriddedPSFModel that is unnecessary because the PSF grid is identical for each model. I can fix this.

larrybradley · 2023-07-12T19:50:55Z

@keflavich #1581 should fix your memory issues with GriddedPSFModel. Let me know if you still have issues. I can trim the fit_results dict if that's the case.

keflavich · 2023-07-12T20:58:34Z

Thanks. Past 15k already, so it looks like an improvement.

keflavich · 2023-07-12T23:21:00Z

Hm, still died, but got a lot further:

Fit source/group:  32%|███▏      | 52828/162563 [25:39<26:37:05,  1.15it/s]

Any idea for further workarounds? Splitting up the image sounds like a possible, but very annoying, way to get around this. Increasing memory isn't really practical

keflavich · 2023-07-13T18:48:38Z

@larrybradley I'd recommend reopening this one; it's not fully solved.

larrybradley · 2023-07-13T19:10:19Z

Yes, I'm working on some improvements now.

keflavich · 2023-07-13T19:13:39Z

Thanks. I'll test 'em right away!

larrybradley · 2023-07-14T03:22:39Z

#1586 is another big reduction in memory for GriddedPSFModel. I have more ideas after that to further reduce memory, but I'll need to refactor a few things.

keflavich · 2023-07-14T19:14:58Z

OK, #1586 looks like it ran to completion, but then my code failed before I could check for sure because I was using get_residual_image instead of make_residual_image. #1558 has required significant revision to my production code.

keflavich · 2023-07-15T01:36:39Z

I had 2 of 3 filters work, and pretty fast! One still failed:

Model image:  77%|███████▋  | 201979/262107 [1:16:28<48:35, 20.63it/s]

Notably, this is at a later stage, so maybe this is solvable by other means

keflavich · 2023-07-15T17:53:06Z

ok, I thought they had completed, but it looks like all runs failed somewhere in the Model image stage, even when I gave more memory.

keflavich · 2023-07-15T18:04:44Z

Looking at the source code for make_model_image, I don't see any reason to run out of memory in that step. It looks like it's only allocating small amounts of memory temporarily, there are no plausible locations for a memory leak in that code.

Here's a record of my failures:

$ tail -n 5 *301997[123]*
==> web-cat-F182M-mrgrep-dao3019972.log <==
Fit source/group: 100%|██████████| 591444/591444 [5:09:48<00:00, 31.82it/s]
2023-07-15T02:59:07.502784: Done with BASIC photometry.  len(result)=591444 dt=18699.997509002686
2023-07-15T02:59:07.703571: len(result) = 591444, len(coords) = 591444, type(result)=<class 'astropy.table.table.QTable'>
Model image:  68%|██████▊   | 403940/591444 [1:14:47<49:36, 63.00it/s]/tmp/slurmd/job3019972/slurm_script: line 4: 76591 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F182M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019972.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F187N-mrgrep-dao3019971.log <==
2023-07-15T02:15:42.773270: Done with ITERATIVE photometry. len(result2)=208262  dt=8322.197668790817
2023-07-15T02:15:43.011038: len(result2) = 208262, len(coords) = 177215
Model image: 100%|██████████| 208262/208262 [06:07<00:00, 566.57it/s]
/blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py:117: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  pl.figure(figsize=(12,12))

==> web-cat-F212N-mrgrep-dao3019973.log <==
2023-07-15T00:03:02.336868: Done with diagnostics for BASIC photometry.  dt=8288.43006491661
2023-07-15T00:03:02.338916: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 227112/227112 [1:39:17<00:00, 38.12it/s]
Model image:  75%|███████▍  | 170227/227112 [27:14<08:46, 108.10it/s]/tmp/slurmd/job3019973/slurm_script: line 4: 84399 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F212N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019973.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
login4.ufhpc /orange/adamginsburg/jwst/brick main$ tail -n 5 *30180[34][089]*
==> web-cat-F405N-mrgrep-dao3018039.log <==
2023-07-14T22:10:48.740898: Done with diagnostics for BASIC photometry.  dt=4295.998880624771
2023-07-14T22:10:48.743855: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 161319/161319 [1:02:50<00:00, 42.78it/s]
Model image:  22%|██▏       | 35267/161319 [05:57<17:55, 117.16it/s]/tmp/slurmd/job3018039/slurm_script: line 4: 40615 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F405N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018039.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F410M-mrgrep-dao3018040.log <==
Fit source/group: 100%|██████████| 262107/262107 [1:25:50<00:00, 50.89it/s]
2023-07-14T22:26:15.041140: Done with BASIC photometry.  len(result)=262107 dt=5186.740335702896
2023-07-14T22:26:15.127437: len(result) = 262107, len(coords) = 262107, type(result)=<class 'astropy.table.table.QTable'>
Model image:  77%|███████▋  | 201988/262107 [29:20<10:40, 93.89it/s]/tmp/slurmd/job3018040/slurm_script: line 4: 64717 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F410M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018040.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F466N-mrgrep-dao3018038.log <==
2023-07-14T21:41:35.963349: Done with diagnostics for BASIC photometry.  dt=2544.2132999897003
2023-07-14T21:41:35.964734: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 101505/101505 [31:38<00:00, 53.46it/s]
Model image:  96%|█████████▋| 97827/101505 [13:04<00:26, 138.45it/s]/tmp/slurmd/job3018038/slurm_script: line 4: 10632 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F466N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018038.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

For context, I'm first running a basic PSFPhotometry run, then subsequently an IterativePSFPhotometry run. So, all of the runs except F182M had a successful PSFPhotometry run, but then failed during IterativePSFPhotometry.

I was running IterativePSFPhotometry with:

                phot_ = IterativePSFPhotometry(finder=daofind_tuned,
                                               localbkg_estimator=LocalBackground(5, 25),
                                               psf_model=dao_psf_model,
                                               fitter=LevMarLSQFitter(),
                                               maxiters=2,
                                               fit_shape=(5, 5),
                                               aperture_radius=2*fwhm_pix,
                                               progress_bar=True
                                              )

so maybe I can shrink the background area a bit and see if it completes.

keflavich · 2023-07-15T18:11:41Z

ah, another data point: I was making the model image (residual image) with 11x11 patches, not 5x5.

larrybradley · 2023-08-11T19:48:58Z

I don't how make_model_image is causing memory issues either. The only additional memory it requires is essentially for the output image (plus small temporary cutouts for an index array). I think make_residual_image does require an additional temporary array, which I removed in #1604.

I also reduced the memory footprint of PSFPhotometry more with #1603, but that should be minor. The models shouldn't be an issue after #1586 (200,000 models ~ 2.3 GB).

Are you using source grouping and have very large groups? I'm wondering if that could be an issue. Large groups should be avoided because it requires fitting a very large multi-dimensional parameter space (which can be slow and error prone, and probably memory intensive).

keflavich · 2023-08-11T21:07:15Z

No, I disabled the grouper, so it's not source grouping.

I'll see if this works better now, post #1604.

larrybradley added the psf label Jul 11, 2023

larrybradley mentioned this issue Jul 12, 2023

Add copy and deepcopy methods to GriddedPSFModel #1581

Merged

larrybradley closed this as completed in #1581 Jul 12, 2023

larrybradley reopened this Jul 13, 2023

larrybradley mentioned this issue Jul 14, 2023

Lower memory usage when fitting with GriddedPSFModel #1586

Merged

larrybradley mentioned this issue Aug 11, 2023

Save only select fit info from PSF fitting #1603

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory errors with refactored code #1580

Memory errors with refactored code #1580

keflavich commented Jul 11, 2023

larrybradley commented Jul 11, 2023

larrybradley commented Jul 11, 2023

keflavich commented Jul 11, 2023

keflavich commented Jul 11, 2023

keflavich commented Jul 11, 2023

keflavich commented Jul 11, 2023

larrybradley commented Jul 11, 2023 •

edited

larrybradley commented Jul 11, 2023

keflavich commented Jul 11, 2023 •

edited

larrybradley commented Jul 11, 2023 •

edited

larrybradley commented Jul 12, 2023

keflavich commented Jul 12, 2023

keflavich commented Jul 12, 2023

keflavich commented Jul 13, 2023

larrybradley commented Jul 13, 2023

keflavich commented Jul 13, 2023

larrybradley commented Jul 14, 2023

keflavich commented Jul 14, 2023

keflavich commented Jul 15, 2023

keflavich commented Jul 15, 2023

keflavich commented Jul 15, 2023

keflavich commented Jul 15, 2023

larrybradley commented Aug 11, 2023

keflavich commented Aug 11, 2023

Memory errors with refactored code #1580

Memory errors with refactored code #1580

Comments

keflavich commented Jul 11, 2023

larrybradley commented Jul 11, 2023

larrybradley commented Jul 11, 2023

keflavich commented Jul 11, 2023

keflavich commented Jul 11, 2023

keflavich commented Jul 11, 2023

keflavich commented Jul 11, 2023

larrybradley commented Jul 11, 2023 • edited

larrybradley commented Jul 11, 2023

keflavich commented Jul 11, 2023 • edited

larrybradley commented Jul 11, 2023 • edited

larrybradley commented Jul 12, 2023

keflavich commented Jul 12, 2023

keflavich commented Jul 12, 2023

keflavich commented Jul 13, 2023

larrybradley commented Jul 13, 2023

keflavich commented Jul 13, 2023

larrybradley commented Jul 14, 2023

keflavich commented Jul 14, 2023

keflavich commented Jul 15, 2023

keflavich commented Jul 15, 2023

keflavich commented Jul 15, 2023

keflavich commented Jul 15, 2023

larrybradley commented Aug 11, 2023

keflavich commented Aug 11, 2023

larrybradley commented Jul 11, 2023 •

edited

keflavich commented Jul 11, 2023 •

edited

larrybradley commented Jul 11, 2023 •

edited