Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

List of beam PTransforms to implement to recreate XarrayZarrRecipe #376

Closed
4 of 7 tasks
rabernat opened this issue Jun 6, 2022 · 3 comments
Closed
4 of 7 tasks

Comments

@rabernat
Copy link
Contributor

rabernat commented Jun 6, 2022

When the Beam refactor is complete (#256), the Pangeo Forge public API will look quite different. Broadly, we will export three main things:

  1. High-level recipe "builders" that work identically or very similarly to our existing Recipe implementation. The precise API and internal structure for these is not yet determined. Will they be Beam PCollections? Dataclasses? We should experiment with some different design patterns?
  2. Beam PTransforms that can be composed together to create recipes. Following the Beam Ptransform style guide, these PTransforms should be lightweight wrappers around generic python functions, which themselves will be exposed as...
  3. General-purpose plain functions

A sketch of how 2 and 3 might look is already underway in #375.

Let's use this issue brainstorm all of the PTransforms we imagine we will need to implement to get feature complete with XarrayZarrRecipe

  • FilePatternSource - Right now we are just using beam.Create(pattern.items() to start our pipelines. Might be better to follow the docs for Custom I/O connectors
  • OpenWithFSSpec - Turn whatever comes out of the FilePattern into an FSSpec OpenFile object. Implemented in Improved Beam Opener PTransforms #375
  • OpenWithXarray - Turn an OpenFile or URL PCollection into an Xarray dataset PCollection (without loading into memory if possible). Implemented in Improved Beam Opener PTransforms #375.
  • InferXarraySchema - Takes a collection of Xarray datasets and figures out a schema for the target dataset. Implemented in Schema aggregation #377.
  • PrepareZarrTarget - Takes that schema and use it to initialize the Zarr target. This is a singleton operation (single item PCollection). Can we make that explicit in beam? Implemented in Initialize target #379
  • RechunkForTarget - Takes the Xarray Dataset PCollection, plus a specification of the target chunks, and returns a new PCollection that is evenly aligned with the target chunks
  • ToZarr - Here we could possibly just use xarray-beam ChunksToZarr. In progress in Zarr fragment writers #391
@alxmrs
Copy link
Contributor

alxmrs commented Jun 17, 2022

Re: FilePatternSource: You may be able to follow some examples of I/O connectors from the Geobeam project: https://github.com/GoogleCloudPlatform/dataflow-geobeam/blob/main/geobeam/io.py#L27

@rabernat rabernat pinned this issue Jun 22, 2022
@alxmrs
Copy link
Contributor

alxmrs commented Jul 12, 2022

One implementation note, related to FilePatternSource and all Openers: I've recently discovered that Geobeam's use of FileBasedSources comes with a performance tradeoff. There are more modern capabilities in the Beam framework that supercharge performance, namely the SplittableDoFn (this blog post provides an excellent explanation of it).

A sister team of mine has recently had great success using SDFs to read GFS data, and I'm starting to look into them in my own projects (google/weather-tools#189).

@cisaacstern
Copy link
Member

cisaacstern commented Aug 25, 2023

This work is complete! 🙌 And I've opened #581 + #582 to continue unresolved discussion threads from this issue. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants