Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange Interactions with Transposition and StoreToZarr #710

Open
ranchodeluxe opened this issue Mar 18, 2024 · 5 comments
Open

Strange Interactions with Transposition and StoreToZarr #710

ranchodeluxe opened this issue Mar 18, 2024 · 5 comments

Comments

@ranchodeluxe
Copy link
Contributor

ranchodeluxe commented Mar 18, 2024

Versions:

pangeo-forge-runner==0.10.2
recipe.py
recipe versions

Problem:

Putting up the bat signal on this one 🦇 📡 b/c it's kept us confused for days. On the LocalDirectRunner and Flink we've noticed that this recipe with transposing coordinates will either hang/stall or dump zero useful tracebacks about where it's failing.

Looking for ideas about the finicky nature of this beast if you have any 🙇

Unfortunately, the source data is in a protected s3 bucket 😞 and the recipe is written to leverage the implied AssumeRole behind the scenes but there's a JH cluster you can be added to if you want to test it out

@ranchodeluxe ranchodeluxe changed the title Strange Runner Interactions with Transposition and StoreToZarr Strange Interactions with Transposition and StoreToZarr Mar 29, 2024
@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Mar 29, 2024

Related to #709 and #715 and h5py/h5py#2019

Doing an issue dump of what we've learned and a thread with great detail from the past that is related ✨

MRE:

Multiple different jobs (beyond the one in the this issue) seem to hang in apache.beam. The first successful attempt to get out of the situation was to remove layers of interference. We turned off fsspec's "readahead" cache. All jobs that had hangs were able to get quite a bit further before hangs happened again. In some cases (like this issue) that change possibly led to useful stack traces that were being swallowed. But we need to verify that. Eventually however there were still hangs.

Getting an MRE was hard and instead we decided to let things hang and take the opportunity to inspect thread traces and dumps to see what we can learn. The environment we were working on didn't give us privilege to install gdb or gcore, strace. We used py-spy instead.

Investigation:

  • we kicked off another job that always hangs on beam's LocalDirectRunner with a single process reading form s3fs

  • using ps ax -F --forest I inspected the most nested process commands until I knew when the final process for beam had kicked off and was running (even though we set the runner to use one process there are still os forks of bash, pangeo-forge-runner and in fact two beam processes to think about 😓 )

  • we waited for memory and cpu to fall to guess when it was hung

  • I ran ps -L --pid <PID> on the most nested PID from above to get some thread ids that I wanted to match in the next step

  • Then using py-spy (which is a great tool) I pointed to the same PID above and did a thread stack dump py-spy dump --pid <PID>

  • The thread stack dump output is great b/c it shows all the idle threads for grpc and two idle threads trying to talk with s3. One thread is the fsspec event loop where we can see xarray's CachingFileManager.__del__ and subsequent h5netcdf close calls happening. And the other is a thread related to beam work that is trying to fetch byte ranges from s3 using xarray

  • The docstring for CachingFileManager mentions GC events triggering it via the __del__. This got us thinking about how Discussion: Composition without Memory Bloat #709 open file handlers could exacerbate a problem where GC wants to run more

  • we forked xarray and naively added gc.disable() to the xarray/backends/file_manager.py module and hangs stopped happening while disabling gc in other spots didn't quite work

  • Then through a series of related issue threads we wound up on this comment that smells about right

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Apr 1, 2024

Next Steps and Ideas:

  • reduce the MRE even further now that we have a good theory:

    • however, we need find which lib or interactions are causing the cycles that we think get GC'd. The older MRE's from the thread above work fine b/c the circular references were removed

    • we know we have reference cycles from open fsspec file-like handlers, so confirm that with a good visual

  • work around: move gc disabling as close as possible to the xarray operations (it was attempted before but didn't help, so let's try again)

  • work around: confirm that runs using fsspec non async file systems such as local work

  • work around: use beam's pythonsdk versions of cloud provider storage APIs which seem synchronous ✅

  • work around: maybe return the old sync fsspec and see if we can use that or build our own

  • work around: add weakreferences in spots so GC isn't so active (once we find where cycles are)

@rabernat
Copy link
Contributor

rabernat commented Apr 1, 2024

Could try with copy_to_local=True in OpenWithXarray.

@ranchodeluxe
Copy link
Contributor Author

ranchodeluxe commented Apr 7, 2024

With my new s3fs fork and S3SyncFileSystem and S3SyncFile I can get the StoreToPyramid workflows to run without hanging. I did a rough "literalist" translation of the async code which isn't what we'll want in the end. A second pass will be needed to go through and figure out best approaches for dealing with async patterns such as asyncio.gather(*[futures]) beyond looping possibly:

@ranchodeluxe
Copy link
Contributor Author

Also, look into using this: https://www.hdfgroup.org/solutions/cloud-amazon-s3-storage-hdf5-connector/ instead of any repurposed synchronous tooling

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants