BUG: Jax-based samplers crash at transformation stage #6744

fonnesbeck · 2023-05-30T12:27:18Z

Describe the issue:

The Jax-based samplers crash after sampling, following the "Transforming variables..." message on medium-to-large models (thousands of rows, hundreds of parameters). This occurs both on GPU and CPU systems, and using either the numpyro or blackjax samplers. The failure on GPU returns a backtrace that isolates the issue at the vmap in _postprocess_samples. On a CPU (MacBook Pro M1), the process is simply killed without any error messages. I have tried running the GPU model with the postprocessing_backend="cpu" argument for the numpyro sampler, but this does not seem to make a difference. Should it be using vmap when the postprocessing backend is CPU?

Reproduceable code example:

Will add example when I can come up with one

Error message:

CPU machine error:


Compilation time =  0:00:09.225151
Sampling...
Running chain 0: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [12:00:33<00:00, 21.62s/it]
Running chain 1: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [12:00:33<00:00, 21.62s/it]
Sampling time =  12:00:35.215191
Transforming variables...
Killed: 9
/Users/cfonnesbeck/mambaforge/envs/pymc/lib/python3.11/multiprocessing/resource_tracker.py:224: UserWarning: resource_tracker: There appear to be 1 leaked semaphore objects to clean up at shutdown
  warnings.warn('resource_tracker: There appear to be %d '

PyMC version information:

PyMC 5.3.0
PyTensor 2.11.1

Context for the issue:

The numpyro sampler is currently unusable for moderate-sized models due to this issue.

The text was updated successfully, but these errors were encountered:

fonnesbeck · 2023-06-07T14:16:13Z

Setting postprocessing_chunks to somewhat large values (~10) seems to prevent this, since it appears to be an issue with vmap.

ricardoV94 · 2023-11-25T07:53:45Z

I think this was solved by switching to scan as the default

fonnesbeck · 2023-12-12T02:44:21Z

I'm still getting out of memory crashes after sampling even when using v5.10. Is it still possible to set postprocessing_chunks? It seemed to work previously.

ricardoV94 · 2023-12-12T08:48:52Z

The options are now scan or vmap, scan is the default which is more memory conscious:

pymc/pymc/sampling/jax.py

Line 188 in c53277b

postprocessing_vectorize: Literal["vmap", "scan"] = "scan",

fonnesbeck · 2023-12-12T13:36:05Z

Yeah, I saw that. I still get crashes post-processing on GPU for large models (even with postprocessing_backend="cpu").

fonnesbeck · 2023-12-12T13:46:27Z

This looks like it might help, though it is not implemented in Jax yet. We should probably keep the option for using xmap in the interim.

ricardoV94 · 2023-12-12T13:57:55Z

We are already using Scan by default, so I don't think it would help

JasonTam · 2024-01-22T01:18:26Z

I'm running into the same OOM issue in post-processing with the default postprocessing_vectorize="scan" .
Is postprocessing_chunks not something that can brought back as an experimental, use at your own risk, parameter?

ricardoV94 · 2024-01-22T01:27:33Z

IIRC postprocessing_chunks is just using scan under the hood anyway, so it shouldn't help. Can you check it actually helps in your case?

We need an example to investigate this issue, but if you see a difference we can consider temporarily reverting while we figure it out

fonnesbeck added the bug label May 30, 2023

ricardoV94 closed this as completed Nov 25, 2023

fonnesbeck reopened this Dec 12, 2023

JasonTam linked a pull request Jan 24, 2024 that will close this issue

Remove swapaxes before and after scan #7116

Open

10 tasks

andrewdipper mentioned this issue May 13, 2024

Reduce JAX sampler memory usage #7311

Open

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: Jax-based samplers crash at transformation stage #6744

BUG: Jax-based samplers crash at transformation stage #6744

fonnesbeck commented May 30, 2023 •

edited

fonnesbeck commented Jun 7, 2023

ricardoV94 commented Nov 25, 2023

fonnesbeck commented Dec 12, 2023

ricardoV94 commented Dec 12, 2023

fonnesbeck commented Dec 12, 2023 •

edited

fonnesbeck commented Dec 12, 2023 •

edited

ricardoV94 commented Dec 12, 2023 •

edited

JasonTam commented Jan 22, 2024 •

edited

ricardoV94 commented Jan 22, 2024

BUG: Jax-based samplers crash at transformation stage #6744

BUG: Jax-based samplers crash at transformation stage #6744

Comments

fonnesbeck commented May 30, 2023 • edited

Describe the issue:

Reproduceable code example:

Error message:

PyMC version information:

Context for the issue:

fonnesbeck commented Jun 7, 2023

ricardoV94 commented Nov 25, 2023

fonnesbeck commented Dec 12, 2023

ricardoV94 commented Dec 12, 2023

fonnesbeck commented Dec 12, 2023 • edited

fonnesbeck commented Dec 12, 2023 • edited

ricardoV94 commented Dec 12, 2023 • edited

JasonTam commented Jan 22, 2024 • edited

ricardoV94 commented Jan 22, 2024

fonnesbeck commented May 30, 2023 •

edited

fonnesbeck commented Dec 12, 2023 •

edited

fonnesbeck commented Dec 12, 2023 •

edited

ricardoV94 commented Dec 12, 2023 •

edited

JasonTam commented Jan 22, 2024 •

edited