Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory errors with refactored code #1580

Open
keflavich opened this issue Jul 11, 2023 · 24 comments · Fixed by #1581
Open

Memory errors with refactored code #1580

keflavich opened this issue Jul 11, 2023 · 24 comments · Fixed by #1581
Labels

Comments

@keflavich
Copy link
Contributor

As noted in another thread, I'm consistently getting out-of-memory errors when running the new PSFPhotometry fitter.

My fitting runs have died at the following stages as ID'd by the progressbar:

Fit source/group:   6%|▋         | 11347/177215 [05:24<1:34:59, 29.10it/s]
Fit source/group:   5%|▍         | 11405/228909 [05:57<30:55:50,  1.95it/s]
Fit source/group:   4%|▍         | 11486/262107 [07:02<26:22:39,  2.64it/s]
Fit source/group:  11%|█         | 11379/102664 [06:45<2:06:09, 12.06it/s]
Fit source/group:   2%|▏         | 11396/591444 [06:51<8:59:34, 17.92it/s]

These are pretty consistent endpoints.

I suspect the problem is that fit_info is being stored in memory. iirc, fit_info includes at least one, and maybe several, copies of the data. Can we minimized the fit_info before storing it? I think only param_cov is used downstream?

Note that I have 256GB of memory allocated for these runs, which imo is a very large amount to dedicate to photometry of a single JWST field-of-view.

@larrybradley
Copy link
Member

I also suspect that the fit_info dictionary is the cause. It doesn't store a copy of the input data, but it does store the output from the fitters, which includes things like the fit residual, jacobian, etc. In general these should be small arrays (usually 5x5 is all that is needed for fitting since that is where most of the flux lies; the size determined by the fit_shape keyword), but I can see how that can add up when you have ~200k stars!

I'll want to keep at the least the fit residuals and the return status message. I'll remove the rest (perhaps as an option since I think your use case is probably on the extreme end). Some people may want all the fit info details.

@larrybradley
Copy link
Member

Just curious -- what fit_shape are you using?

@keflavich
Copy link
Contributor Author

11x11. If I switch to 5x5, I'd roughly expect to get to 4x more sources...

@keflavich
Copy link
Contributor Author

...assuming only one footprint, of course, which is probably an underestimate

@keflavich
Copy link
Contributor Author

Reducing fit_shape to 5,5 had no effect, which surprises me.

@keflavich
Copy link
Contributor Author

I'm trying with a hack, changing:

                fit_info = self.fitter.fit_info.copy()

to

                fit_info = {key: self.fitter.fit_info.get(key)
                            for key in
                            ('param_cov', 'fvec', 'fun', 'ierr', 'status')
                           }

@larrybradley
Copy link
Member

larrybradley commented Jul 11, 2023

I did some testing, and I don't think the fit_info dict is the cause. I fit 15,000 stars (your failures were at <12,000 stars) with a fit_shape = (11, 11) and the fit_results size is only 194 MB. The PSF phot object total is 199 MB. The peak memory during the fitting was 7.7 GB. This was using a IntegratedGaussianPRF model. And I did not use grouping.

My next suspect is the PSF model. Are you using a GriddedPSFModel with very large (internal) PSF arrays and/or a large number of them?

@larrybradley
Copy link
Member

Could you please send me your input PSF model?

@keflavich
Copy link
Contributor Author

keflavich commented Jul 11, 2023

yes, I'm using a webbpsf model. Can be reproduced with:

                    import webbpsf
                    obsdate = '2022-08-28'
                    nrc = webbpsf.NIRCam()
                    nrc.load_wss_opd_by_date(f'{obsdate}T00:00:00')
                    nrc.filter = 'F405N'
                    nrc.detector = 'NRCA5'
                    grid = nrc.psf_grid(num_psfs=16, all_detectors=False, verbose=True, save=True)
                    psf_model = grid

I think... I haven't tested this; in production, the obsdate and some other variables come from FITS headers

EDIT: tested, this works now.

@larrybradley
Copy link
Member

larrybradley commented Jul 11, 2023

Thanks. Your PSF model is ~20 MB. 12_000 of them is ~233 GB (just for the PSF models, not the data, results, etc.). So that seems to be the culprit. The code returns a copy of the fit models. But it's copying the entire model. For the GriddedPSFModel that is unnecessary because the PSF grid is identical for each model. I can fix this.

@larrybradley
Copy link
Member

@keflavich #1581 should fix your memory issues with GriddedPSFModel. Let me know if you still have issues. I can trim the fit_results dict if that's the case.

@keflavich
Copy link
Contributor Author

Thanks. Past 15k already, so it looks like an improvement.

@keflavich
Copy link
Contributor Author

Hm, still died, but got a lot further:

Fit source/group:  32%|███▏      | 52828/162563 [25:39<26:37:05,  1.15it/s]

Any idea for further workarounds? Splitting up the image sounds like a possible, but very annoying, way to get around this. Increasing memory isn't really practical

@keflavich
Copy link
Contributor Author

@larrybradley I'd recommend reopening this one; it's not fully solved.

@larrybradley larrybradley reopened this Jul 13, 2023
@larrybradley
Copy link
Member

Yes, I'm working on some improvements now.

@keflavich
Copy link
Contributor Author

Thanks. I'll test 'em right away!

@larrybradley
Copy link
Member

#1586 is another big reduction in memory for GriddedPSFModel. I have more ideas after that to further reduce memory, but I'll need to refactor a few things.

@keflavich
Copy link
Contributor Author

OK, #1586 looks like it ran to completion, but then my code failed before I could check for sure because I was using get_residual_image instead of make_residual_image. #1558 has required significant revision to my production code.

@keflavich
Copy link
Contributor Author

I had 2 of 3 filters work, and pretty fast! One still failed:

Model image:  77%|███████▋  | 201979/262107 [1:16:28<48:35, 20.63it/s]

Notably, this is at a later stage, so maybe this is solvable by other means

@keflavich
Copy link
Contributor Author

ok, I thought they had completed, but it looks like all runs failed somewhere in the Model image stage, even when I gave more memory.

@keflavich
Copy link
Contributor Author

Looking at the source code for make_model_image, I don't see any reason to run out of memory in that step. It looks like it's only allocating small amounts of memory temporarily, there are no plausible locations for a memory leak in that code.

Here's a record of my failures:

$ tail -n 5 *301997[123]*
==> web-cat-F182M-mrgrep-dao3019972.log <==
Fit source/group: 100%|██████████| 591444/591444 [5:09:48<00:00, 31.82it/s]
2023-07-15T02:59:07.502784: Done with BASIC photometry.  len(result)=591444 dt=18699.997509002686
2023-07-15T02:59:07.703571: len(result) = 591444, len(coords) = 591444, type(result)=<class 'astropy.table.table.QTable'>
Model image:  68%|██████▊   | 403940/591444 [1:14:47<49:36, 63.00it/s]/tmp/slurmd/job3019972/slurm_script: line 4: 76591 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F182M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019972.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F187N-mrgrep-dao3019971.log <==
2023-07-15T02:15:42.773270: Done with ITERATIVE photometry. len(result2)=208262  dt=8322.197668790817
2023-07-15T02:15:43.011038: len(result2) = 208262, len(coords) = 177215
Model image: 100%|██████████| 208262/208262 [06:07<00:00, 566.57it/s]
/blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py:117: RuntimeWarning: More than 20 figures have been opened. Figures created through the pyplot interface (`matplotlib.pyplot.figure`) are retained until explicitly closed and may consume too much memory. (To control this warning, see the rcParam `figure.max_open_warning`). Consider using `matplotlib.pyplot.close()`.
  pl.figure(figsize=(12,12))

==> web-cat-F212N-mrgrep-dao3019973.log <==
2023-07-15T00:03:02.336868: Done with diagnostics for BASIC photometry.  dt=8288.43006491661
2023-07-15T00:03:02.338916: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 227112/227112 [1:39:17<00:00, 38.12it/s]
Model image:  75%|███████▍  | 170227/227112 [27:14<08:46, 108.10it/s]/tmp/slurmd/job3019973/slurm_script: line 4: 84399 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F212N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3019973.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.
login4.ufhpc /orange/adamginsburg/jwst/brick main$ tail -n 5 *30180[34][089]*
==> web-cat-F405N-mrgrep-dao3018039.log <==
2023-07-14T22:10:48.740898: Done with diagnostics for BASIC photometry.  dt=4295.998880624771
2023-07-14T22:10:48.743855: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 161319/161319 [1:02:50<00:00, 42.78it/s]
Model image:  22%|██▏       | 35267/161319 [05:57<17:55, 117.16it/s]/tmp/slurmd/job3018039/slurm_script: line 4: 40615 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F405N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018039.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F410M-mrgrep-dao3018040.log <==
Fit source/group: 100%|██████████| 262107/262107 [1:25:50<00:00, 50.89it/s]
2023-07-14T22:26:15.041140: Done with BASIC photometry.  len(result)=262107 dt=5186.740335702896
2023-07-14T22:26:15.127437: len(result) = 262107, len(coords) = 262107, type(result)=<class 'astropy.table.table.QTable'>
Model image:  77%|███████▋  | 201988/262107 [29:20<10:40, 93.89it/s]/tmp/slurmd/job3018040/slurm_script: line 4: 64717 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F410M --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018040.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

==> web-cat-F466N-mrgrep-dao3018038.log <==
2023-07-14T21:41:35.963349: Done with diagnostics for BASIC photometry.  dt=2544.2132999897003
2023-07-14T21:41:35.964734: About to do ITERATIVE photometry....
Fit source/group: 100%|██████████| 101505/101505 [31:38<00:00, 53.46it/s]
Model image:  96%|█████████▋| 97827/101505 [13:04<00:26, 138.45it/s]/tmp/slurmd/job3018038/slurm_script: line 4: 10632 Killed                  /blue/adamginsburg/adamginsburg/miniconda3/envs/python39/bin/python /blue/adamginsburg/adamginsburg/jwst/brick/analysis/crowdsource_catalogs_long.py --filternames=F466N --modules=merged-reproject --daophot --skip-crowdsource
slurmstepd: error: Detected 1 oom-kill event(s) in StepId=3018038.batch. Some of your processes may have been killed by the cgroup out-of-memory handler.

For context, I'm first running a basic PSFPhotometry run, then subsequently an IterativePSFPhotometry run. So, all of the runs except F182M had a successful PSFPhotometry run, but then failed during IterativePSFPhotometry.

I was running IterativePSFPhotometry with:

                phot_ = IterativePSFPhotometry(finder=daofind_tuned,
                                               localbkg_estimator=LocalBackground(5, 25),
                                               psf_model=dao_psf_model,
                                               fitter=LevMarLSQFitter(),
                                               maxiters=2,
                                               fit_shape=(5, 5),
                                               aperture_radius=2*fwhm_pix,
                                               progress_bar=True
                                              )

so maybe I can shrink the background area a bit and see if it completes.

@keflavich
Copy link
Contributor Author

ah, another data point: I was making the model image (residual image) with 11x11 patches, not 5x5.

@larrybradley
Copy link
Member

I don't how make_model_image is causing memory issues either. The only additional memory it requires is essentially for the output image (plus small temporary cutouts for an index array). I think make_residual_image does require an additional temporary array, which I removed in #1604.

I also reduced the memory footprint of PSFPhotometry more with #1603, but that should be minor. The models shouldn't be an issue after #1586 (200,000 models ~ 2.3 GB).

Are you using source grouping and have very large groups? I'm wondering if that could be an issue. Large groups should be avoided because it requires fitting a very large multi-dimensional parameter space (which can be slow and error prone, and probably memory intensive).

@keflavich
Copy link
Contributor Author

No, I disabled the grouper, so it's not source grouping.

I'll see if this works better now, post #1604.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants