Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The future of pangeo CMIP6 in the cloud #31

Open
jbusecke opened this issue Jan 28, 2022 · 10 comments
Open

The future of pangeo CMIP6 in the cloud #31

jbusecke opened this issue Jan 28, 2022 · 10 comments

Comments

@jbusecke
Copy link
Collaborator

jbusecke commented Jan 28, 2022

I would like to start a high level discussion about the priorities and organization of the pangeo CMIP6 archive in the cloud.

I have officially? taken over this effort, and would first and foremost thank @naomi-henderson for her tireless work in getting this effort of the ground! It is amazing what is already possible with all this data.

Overall I would like to discuss:

  • The generation and organization of zart stores and the user facing catalog
  • How to provide more efficient/automated communication with the growing user base.

Data generation/organization
I think the central model of organization works well. We store ‘datasets’ (time concatenated) as zarr stores in a cloud bucket, and use a csv based on the unique combination of “facets”. What we should discuss here is how to populate the cloud bucket. This is closely related to the question of how users can request additional datasets.
Previously this was managed using a request form and manual creation/upload if the datasets.
I hope we can eventually fully automate this, but I am aware that there might be a transition phase.
Ultimately I would like to be able to build a pangeo-forge recipe from a dataset_id string

"mip_era.activity_id.institution_id.source_id.experiment_id.variant_label.table_id.variable_id.grid_label.version"

(exact facets used pending an answer here)
, and it would build and upload the final zarr store entirely in the cloud (cc @cisaacstern). But since this is likely a more involved undertaking what do folks think about intermediate solutions?

Related: As suggested here we could consider augmenting the existing global variables with other attributes that might make things easier in the long run.

Filtering retracted datasets
The basic idea (established by @naomi-henderson) is that we maintain a “raw” catalog with every store ever created, and then filter it to have a user-facing “main” catalog, which only includes valid datasets and only the newest versions.
For a more in depth discussion see #30.

User communication and derived datasets
I think it is very important that we establish better visibility of “what is going on” and some sort of way to inform users of new developments. I really want to minimize the interactions via email!
Some ideas I had:

  • Automated twitter bot for the newest datasets
  • An automatically updated “whats new” section on this repos docs. Ideally this would also include a web interface with search for all the datasets that have already been uploaded/requested and their status (e.g. “has problems” “in queue”). If we build this it might actually be possible to be able to index ALL esgf datsets and then request them with the click of a button if they are not yet in the cloud (one can dream right?).

Any other ideas from ppl here?

Another issue that I (and seemingly other users) care about a lot is the issue of derived datasets. I have been prototyping some ways to generate derived datasets but am not quite sure what the best way is for a broader audience to contribute datasets. You can find more discussion .

cc @rabernat

@rabernat
Copy link
Member

rabernat commented Jan 28, 2022

Julius, thanks so much for raising this important discussion. You are asking all the right questions.

Political Stuff

For me, the elephant in the room is the future of ESGF itself. I think our project has demonstrated the value of and ARCO copy of CMIP6 in the cloud. That demonstration has influenced how ESGF is thinking about their infrastructure, as evidenced from the engagement from ESGF core members in our ongoing working group. At the same time, it has made clear that, although we have plenty of ideas, we (as in you, me, and @cisaacstern) do not have the bandwidth ourselves to expand the scope of what we are already doing without additional resources (person power and / or funding).

In my ideal world, the job of providing ARCO CMIP6 data access in a cloud-native way would simply be taken over by ESGF. Their mandate is to provide access to the CMIP6 data, and they have a whole boatload of funding to do so. I would love for the ESGF infrastructure to be so fast, performant, and easy to use that the Zarr data copy becomes unnecessary. On the other hand, the existing Zarr data has many users now dependent on it, so we can't just delete it.

To decide how to proceed, we need answers from ESGF. Specifically:

  • Does the GFDL team lead by @aradhakrishnanGFDL plan to continue to expand and maintain the NetCDF archive in S3? If so, that makes our life easier, because we can simply say: we will just mirror the NetCDF data with Zarr. Any new data requests should go through GFDL / ESGF.
  • Will the new ESGF 2.0 project, to be announced any day (cc @scollis), include any elements of cloud-native data access? If so, can those component eventually supersede the Zarr data? At the least, can they make it simpler for us to maintain the Zarr data?

As explained by Scott, ESGF 2.0 won't kick off until congress passes a budget. So we are in a holding pattern on all of this.

Technical Stuff

Ultimately I would like to be able to build a pangeo-forge recipe from a dataset_id string

This is exactly what is demonstrated in this Pangeo Forge tutorial - https://pangeo-forge.readthedocs.io/en/latest/tutorials/xarray_zarr/cmip6-recipe.html - for the dataset GFDL-CM4.historical.r1i1p1f1.Amon.tas.gr1

This shows that it is indeed feasible to use Pangeo Forge to produce CMIP6 Zarrs in an automated way.

Another really transformative development since we started this project has been the creation of kerchunk. With Kerchunk, we can have the best of both worlds: Zarr-style access to existing netCDF data in the cloud. Kerchunk also allows us to virtually concatenate multiple files. That would enable us to basically reproduce the existing Zarr convenience (fast access, single file for all timesteps) without duplicating a petabyte of data. So going forward, I feel strongly that we should be exploring this option. The main downside is that it only works with python. We have no idea what languages people are using to access the Zarr data. Julia folks certainly are.


Bottom line, I would love to hand off the problem of simply providing the CMIP6 data on the cloud to ESGF in some way. Then we could focus on adding value on top of that data, namely by generating the kerchunk indexes and then moving on the exciting problem of derived datasets. But in the meantime, the most important priority is to remove retracted data. So thanks for working on that! 👏

@scollis
Copy link

scollis commented Jan 28, 2022

Hey @rabernat , thanks for the mention. Youthful exuberance will be my undoing as ESGF2.0 is yet to be awarded, linked to the issue you mention. We have to be cautious being to public with our planning (something that does, to no surprise to those that know me, irk me). Happy to talk offline. Let's just say the proposal not only seeks to link heavily to Pangeo but contribute to it where appropriate.

@durack1
Copy link

durack1 commented Feb 3, 2022

Just adding the right handle to this thread @aradhakrishnanGFDL welcome aboard!

@jbusecke
Copy link
Collaborator Author

jbusecke commented Feb 4, 2022

@rabernat and I have worked on a prototype script that would take a dataset_id/instance_id and return a corresponding cloud store.

I am extremely excited to see the first test run working.

During the CMIP/ESGF zoom today we did however see some things that need to be improved:

  • I believe we are currently only querying one single esgf node here. It should be fairly simple to query all available nodes (in an order with the most preferred first) and break the loop if the query returns results
  • We did encounter an issue with grid files (that have no time dimension to concatenate). If we could fix this upstream (Ignore concat_dim if only one file is passed pangeo-forge/pangeo-forge-recipes#275) I think that would be the most elegant solution.

Please feel free to post any feedback/comments here.

@rabernat
Copy link
Member

rabernat commented Feb 4, 2022

Great to see you here @durack1! You're welcome at our bi-weekly working group meetings any time! (Though I'm sure you already have enough meetings to keep you busy. 😉 )

@durack1
Copy link

durack1 commented Feb 4, 2022

Thanks @rabernat! We have WIP representation with @matthew-mizielinski so I think we have our bases covered, also happy to be pinged (@durack1) if you wanted a secondary perspective to be chimed in. I am normally locked up with kiddy dropoff until ~8:15 PST, so would have a hard time attending the early meetings anyway

@jbusecke
Copy link
Collaborator Author

jbusecke commented Jul 27, 2022

Hey folks, it has been a while.

I just wanted to give an update on our efforts.

@cisaacstern and I have been working hard on a singular recipe (cmip-feedstock) where we were quite successful in automating @naomi-henderson s efforts, so that the user only has to provide the instance-id and the recipe generator does the rest.

We have used several requests (pangeo-forge/cmip6-feedstock#10, pangeo-forge/cmip6-feedstock#5, pangeo-forge/cmip6-feedstock#3) to expand the recipe to several 10s of datasets.

After some initial successes we realized that encapsulating the entirety of CMIP6 into a single recipe will be extraordinarily difficult to maintain.

Together with @rabernat we agree that we need to split CMIP6 into several feedstocks, and centralize the 'generator code' (which queries the ESGF API to get both the urls and dynamically generated keyword arguments for the recipe) in a single repository: https://github.com/jbusecke/pangeo-forge-esgf.
I have made some progress today in terms of performance and reliability, but the code probably needs more refactoring (and testing). Factoring this code out into a separate repo ensures that each one of the split repositories profits from improvements made in the generator code, and also that the recipe code in each feedstock remains relatively light

But the question I would bring up for discussion is how to split up the feedstocks?
Splitting along one/several of the facets seems like a natural choice.

I believe that splitting the feedstocks along the model (source_id) is the most natural choice from a maintainer perspective. Preprocessing issues are more likely to be similar in the output from each model. But from a user/requester standpoint this seems absolutely less than ideal. I believe one of the most common requests is for one/several variables across all available models, which in this case would require them to make some 25+ PRs to different feedstocks.

I propose a two stage organization:

  • For the intermediate term (ongoing prototyping focused on requests) I propose splitting along activity_id, see e.g. Add first split CMIP6 feedstock pangeo-forge/staged-recipes#162. This is currently convenient for fulfilling requests
  • In the far term I think the above split along source_ids is feasible if we can build a 'PR bot' that is able to automatically dispatch PRs from a centralized 'request repo'. EDIT: And if we can handle this amount of feedstocks nicely on the catalog end

I would love to get some feedback from folks here on possible other aspects that I might have overlooked.

@jbusecke
Copy link
Collaborator Author

In the far term I think the above split along source_ids is feasible if we can build a 'PR bot' that is able to automatically dispatch PRs from a centralized 'request repo'.

I have just migrated some logic to pangeo-forge-esgf that could form the base for such a bot.

Id imagine the user would make some request like: **I want all historical data of variable xyz. **

  • This would be submitted to a central repo by adding something like "CMIP6.CMIP.*.*.historical.*.*.xyz.*.*" to a textfile.
  • This textfile would then be parsed using this logic, which creates a long 'main' list of instance_ids that users want to work with.
  • To actually process these, we would split them automatically by the source_id facet and then automatically make a PR to the corresponding feedstock. This hopefully requires no input from the user (except maybe if something goes wrong 🤪).

@cisaacstern
Copy link
Member

Where do you envision this meta-request would be made? That is, from where is the bot's activity triggered?

@jbusecke
Copy link
Collaborator Author

Excellent question. I was naively thinking that we could convert the cmip6-feedstock into a 'phantom' feedstock which only contains a textfile with requested ids and the logic to dispatch to the other feedstocks?

But this is something where I am fairly inexperienced with, so maybe you have a better approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants