Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding liveocean recipe #154

Closed
wants to merge 5 commits into from
Closed

adding liveocean recipe #154

wants to merge 5 commits into from

Conversation

rsignell-usgs
Copy link
Contributor

Closes #152

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha a32e84e224b31e5ace3d6d9d5236114b106f6ef6:

@cisaacstern
Copy link
Member

Thanks for this contribution, @rsignell-usgs! I'll trigger a test of this recipe now.

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=987

@pangeo-forge-bot
Copy link

It looks like your meta.yaml does not conform to the specification.

            1 validation error for MetaYaml
    provenance -> providers -> 0 -> description
      field required (type=value_error.missing)

Please correct your meta.yaml and commit the corrections to this PR.

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 6645efc0ea7252c98ea112dc9200befb9d787921:

@rsignell-usgs
Copy link
Contributor Author

@cisaacstern , what is the next step? I see there is an error in the action here: https://github.com/pangeo-forge/staged-recipes/actions/runs/2689822509

@rabernat
Copy link
Contributor

provenance -> providers -> 0 -> description
field required (type=value_error.missing)

FWIW, I find the requirement to provide a description of the provider to be a bit confusing and unnecessary.

@rsignell-usgs
Copy link
Contributor Author

@raberat, agreed! As a first-timer:

  • I found the "description" confusing.
  • I didn't know if I could add other fields like ORCID and github to the provider
  • I didn't know what to use for pangeo_notebook_version

@rabernat
Copy link
Contributor

/run recipe-test recipe_run_id=989

@rabernat
Copy link
Contributor

@cisaacstern - any idea what's happening with this one? This is our first reference recipe in PF, so I am assuming some assumptions will break.

@cisaacstern
Copy link
Member

cisaacstern commented Jul 20, 2022

Yes, just checked. This is actually due to the fact that 0.8.3 is the latest version of recipes available on the cloud platform. This is a growing pains thing: obviously the latest recipes release should be automatically available on the cloud! But until we've worked out that automation, these are the kind of undocumented issues that arise. 🙃 ... I'll downgrade the recipes version for this recipe, and re-trigger the test. Edit: I don't think this recipe relies on any 0.9.0 features, so this should be fine.

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 5a4367b198622f45411f74804262061cb4063baf:

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=993

@rabernat
Copy link
Contributor

This recipe will not deposit a zarr dataset at all, but a kerchunk json file called reference.json. So we will have to deal with that at some point.

@pangeo-forge-bot
Copy link

✨ A test of your recipe liveocean is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/993

@pangeo-forge-bot
Copy link

Pangeo Forge Cloud told me that our test of your recipe liveocean failed. But don't worry, I'm sure we can fix this!

To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/993

If you haven't yet tried pruning and running your recipe locally, I suggest trying that now.

Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps!

@cisaacstern
Copy link
Member

We can see in the logs link above that the error is

packages/pangeo_forge_recipes/recipes/reference_hdf_zarr.py", line 51, in finalize mzz = MultiZarrToZarr( TypeError: __init__() got an unexpected keyword argument 'coo_dtypes'

This seems like a kerchunk version issue in the cloud environment, not a recipe issue... I'm investigating.

@cisaacstern
Copy link
Member

I believe I've fixed this issue by bumping the kerchunk version in the worker image. I'll re-run the test now.

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=993

@pangeo-forge-bot
Copy link

✨ A test of your recipe liveocean is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

In the meantime, you can follow the logs for this recipe run at https://pangeo-forge.org/dashboard/recipe-run/993

@pangeo-forge-bot
Copy link

Pangeo Forge Cloud told me that our test of your recipe liveocean failed. But don't worry, I'm sure we can fix this!

To see what error caused the failure, please review the logs at https://pangeo-forge.org/dashboard/recipe-run/993

If you haven't yet tried pruning and running your recipe locally, I suggest trying that now.

Please report back on the results of your local testing in a new comment below, and a Pangeo Forge maintainer will help you with next steps!

@cisaacstern cisaacstern added the dev (dev use only) directs registrar calls to staging api label Jul 20, 2022
@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 149521f3017c54cb4d4af6d4af1d9069bce3cf06:

Note: This PR is deployed to Pangeo Forge Cloud's dev backend, for which a full frontend website in not currently available. The links below therefore point to plain text information about the created recipe run(s).

@cisaacstern
Copy link
Member

/run recipe-test recipe_run_id=66

@pangeo-forge-bot
Copy link

✨ A test of your recipe liveocean is now running on Pangeo Forge Cloud!

I'll notify you with a comment on this thread when this test is complete. (This could be a little while...)

Note: This test is deployed to Pangeo Forge Cloud's dev backend, for which public logs are not yet available.

@pangeo-forge-bot
Copy link

@cisaacstern
Copy link
Member

This worked on Dataflow 🥳 . But as predicted by Ryan above, as our first reference recipe, @pangeo-forge-bot got confused regarding both:

  • the path it was stored in (terminates with .zarr); and
  • how to open it (assumed it was zarr, thus the error message in the last comment)

That being said, the dataset does exist, and is openable. The following files were created:

import s3fs
fs = s3fs.S3FileSystem(anon=True, client_kwargs=dict(endpoint_url="https://ncsa.osn.xsede.org"))
url = "s3://Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr"
fs.ls(url)
['Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.json',
 'Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.yaml']

I wasn't clear how to open this directly over http, so I first downloaded the reference.json

!wget 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.json'

Then opened it according to the method demonstrated in our tutorial here

import fsspec
import xarray as xr

m = fsspec.get_mapper(
    "reference://",
    fo="reference.json",
    target_protocol="file",
    remote_protocol="http",
    remote_options=dict(anon=True),
    skip_instance_cache=True,
)
ds = xr.open_dataset(
    m,
    engine='zarr',
    backend_kwargs={'consolidated': False},
    chunks={},
    decode_coords="all"
)
ds
Click to expand dataset repr
<xarray.Dataset>
Dimensions:         (ocean_time: 2, s_w: 31, eta_rho: 1302, xi_rho: 663,
                     tracer: 11, s_rho: 30, boundary: 4, eta_u: 1302,
                     xi_u: 662, eta_v: 1301, xi_v: 663, eta_psi: 1301,
                     xi_psi: 662)
Coordinates: (12/16)
    Cs_r            (ocean_time, s_rho) float64 dask.array<chunksize=(1, 30), meta=np.ndarray>
    Cs_w            (ocean_time, s_w) float64 dask.array<chunksize=(1, 31), meta=np.ndarray>
    h               (ocean_time, eta_rho, xi_rho) float64 dask.array<chunksize=(1, 651, 332), meta=np.ndarray>
    hc              (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    lat_psi         (eta_psi, xi_psi) float64 dask.array<chunksize=(651, 331), meta=np.ndarray>
    lat_rho         (eta_rho, xi_rho) float64 dask.array<chunksize=(651, 332), meta=np.ndarray>
    ...              ...
    lon_u           (eta_u, xi_u) float64 dask.array<chunksize=(651, 331), meta=np.ndarray>
    lon_v           (eta_v, xi_v) float64 dask.array<chunksize=(651, 332), meta=np.ndarray>
  * ocean_time      (ocean_time) datetime64[ns] 2022-03-18 2022-03-18T01:00:00
  * s_rho           (s_rho) float64 -0.9833 -0.95 -0.9167 ... -0.05 -0.01667
  * s_w             (s_w) float64 -1.0 -0.9667 -0.9333 ... -0.06667 -0.03333 0.0
    zeta            (ocean_time, eta_rho, xi_rho) float32 dask.array<chunksize=(1, 1302, 663), meta=np.ndarray>
Dimensions without coordinates: eta_rho, xi_rho, tracer, boundary, eta_u, xi_u,
                                eta_v, xi_v, eta_psi, xi_psi
Data variables: (12/133)
    AKs             (ocean_time, s_w, eta_rho, xi_rho) float32 dask.array<chunksize=(1, 8, 434, 221), meta=np.ndarray>
    AKv             (ocean_time, s_w, eta_rho, xi_rho) float32 dask.array<chunksize=(1, 8, 434, 221), meta=np.ndarray>
    Akk_bak         (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    Akp_bak         (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    Akt_bak         (ocean_time, tracer) float64 dask.array<chunksize=(1, 11), meta=np.ndarray>
    Akv_bak         (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    ...              ...
    zooFegest       (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    zooI0           (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    zooKs           (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    zooMin          (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    zooZeta         (ocean_time) float64 dask.array<chunksize=(1,), meta=np.ndarray>
    zooplankton     (ocean_time, s_rho, eta_rho, xi_rho) float32 dask.array<chunksize=(1, 8, 434, 221), meta=np.ndarray>
Attributes: (12/40)
    CPP_options:       U0KB, ADD_FSOBC, ADD_M2OBC, ANA_BPFLUX, ANA_BSFLUX, AN...
    Conventions:       CF-1.4, SGRID-0.3
    NLM_LBC:           \nEDGE:           WEST   SOUTH  EAST   NORTH  \nzeta: ...
    ana_file:          ROMS/Functionals/ana_btflux.h, ROMS/Functionals/ana_st...
    bio_file:          ROMS/Nonlinear/Biology/npzd2o_banas.h
    bpar_file:         /gscratch/macc/parker/LO_roms/cas6_v0_u0kb/f2022.03.18...
    ...                ...
    svn_rev:           824M
    svn_url:           https://www.myroms.org/svn/src/trunk
    tiling:            020x020
    title:             First LiveOcean input file
    type:              ROMS/TOMS history file
    var_info:          /gscratch/macc/parker/LiveOcean_roms/LO_ROMS/ROMS/Exte...

Completing pangeo-forge/pangeo-forge-recipes#268 would be useful for @pangeo-forge-bot to identify the dataset type, then on the backend we could do something like,

dataset_type = getattr(recipe, "dataset_type")
if  dataset_type == "zarr":
    # ...
elif dataset_type == "reference":
   # ... 

@rabernat or @rsignell-usgs, what's the most concise way to open

'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.json'

directly over http?

@rabernat
Copy link
Contributor

rabernat commented Jul 21, 2022

This is excellent progress! 🎉 The reference dataset was successfully created and the reference files were deposited in OSN.

We just have to refactor the orchestration code to not assume that everything deposited will be Zarr. A class variable on each recipe class could be useful here (edit: duh that's exactly what pangeo-forge/pangeo-forge-recipes#268 is 🙃 )

what's the most concise way to open...directly over HTTP?

I'll leave this question to Rich. How do you want to interact with this data?

@rsignell-usgs
Copy link
Contributor Author

rsignell-usgs commented Jul 22, 2022

Would it be appropriate to have pangeo-forge generate an intake catalog?
That would be the easiest way for users to interact!

sources:
  LiveOcean-Archive:
    driver: intake_xarray.xzarr.ZarrSource
    description: 'LiveOcean Forecast Archive'
    args:
      urlpath: "reference://"
      consolidated: false
      storage_options:
        fo: 'https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.json'
        remote_options:
          anon: true
          client_kwargs: {'endpoint_url': 'https://mghp.osn.xsede.org'}
        remote_protocol: s3

@rabernat
Copy link
Contributor

In fact it already has!

https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.yaml

Which contains

sources:
  data:
    args:
      chunks: {}
      consolidated: false
      storage_options:
        fo: s3:///Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.json
        remote_options:
          anon: true
          client_kwargs:
            endpoint_url: https://mghp.osn.xsede.org/
        remote_protocol: s3
        skip_instance_cache: true
        target_options: {}
        target_protocol: s3
      urlpath: reference://
    description: ''
    driver: intake_xarray.xzarr.ZarrSource

However, it doesn't seem to work

import intake
cat_url = "https://ncsa.osn.xsede.org/Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.yaml"
cat = intake.open_catalog(cat_url)
ds = cat.data.to_dask()
NoSuchBucket: An error occurred (NoSuchBucket) when calling the ListObjectsV2 operation: The specified bucket does not exist

@martindurant - do you see any problem with the intake file?

@rsignell-usgs
Copy link
Contributor Author

The intake catalog I supplied above works.

The catalog produced by pangeo-forge looks like it has a few problems:

  • extra slash in the URL to the JSON
  • even after converting 3 slashes to 2, the JSON URL doesn't seem to be public
import fsspec
fs = fsspec.filesystem('s3', anon=True) 
fs.ls('s3://Pangeo/pangeo-forge/')

returns
"No Such Bucket"

@martindurant
Copy link

Upper case bucket names are pretty unusual, but it seems to be allowed; but I also get NoSuchBucket with either "Pangeo" or "pangeo". On AWS S3, "pangeo" does exist, but needs credentials.

@rabernat
Copy link
Contributor

rabernat commented Jul 22, 2022

Remember that this data is on OSN, so you need the custom endpoints. The really complicated part is that there are actually two enpoints:

  • https://ncsa.osn.xsede.org/ - this is where the reference files live
  • https://mghp.osn.xsede.org/ - this is where the actual netcdf files live

@martindurant
Copy link

I am supposing that the target_options (to read the json file) should be the same as the remote_options (to read the data); but I still get no-bucket:

s3 = fsspec.filesystem("s3", anon=True, client_kwargs={"endpoint_url": "https://mghp.osn.xsede.org/"})
s3.cat("Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.json")

@rsignell-usgs
Copy link
Contributor Author

rsignell-usgs commented Jul 22, 2022

This slight modification of the pangeo-forge catalog works. Just needed to flesh out the target_options to include anon=True and the endpoint_url:

sources:
  data:
    args:
      chunks: {}
      consolidated: false
      storage_options:
        fo: s3://Pangeo/pangeo-forge-test/staging/recipe-run-66/pangeo-forge/staged-recipes/liveocean.zarr/reference.json
        remote_options:
          anon: true
          client_kwargs:
            endpoint_url: https://mghp.osn.xsede.org/
        remote_protocol: s3
        skip_instance_cache: true
        target_options:
          anon: true
          client_kwargs:
            endpoint_url: https://ncsa.osn.xsede.org/
        target_protocol: s3
      urlpath: reference://
    description: ''
    driver: intake_xarray.xzarr.ZarrSource

@martindurant
Copy link

Right, different endpoints :)

@cisaacstern
Copy link
Member

To generate that automatically, I think we'll need a PR to pass target_options through to MultiZarrToZarr here?

@rsignell-usgs
Copy link
Contributor Author

@martindurant, I could try this, but I have a feeling it would be smoother if you did the PR!

@rsignell-usgs
Copy link
Contributor Author

rsignell-usgs commented Aug 1, 2022

@peterm790, want to take a stab at fixing this?

@rsignell-usgs
Copy link
Contributor Author

@peterm790, just putting this back on your radar...

@peterm790
Copy link

Hi yes, I have a branch set up with target_options added here. I just haven't worked out a way of setting up a test for this. If that is even needed?

@rsignell-usgs
Copy link
Contributor Author

@peterm790 go ahead and submit a PR and the team will tell you what's needed!

@pangeo-forge-bot
Copy link

🎉 New recipe runs created for the following recipes at sha 149521f3017c54cb4d4af6d4af1d9069bce3cf06:

Note: This PR is deployed to Pangeo Forge Cloud's dev backend, for which a full frontend website in not currently available. The links below therefore point to plain text information about the created recipe run(s).

@rsignell-usgs rsignell-usgs mentioned this pull request Aug 31, 2022
@cisaacstern
Copy link
Member

cisaacstern commented Aug 31, 2022

@rsignell-usgs thanks so much for re-opening this. IIUC, this depends on pangeo-forge/pangeo-forge-recipes#399, which has not been released yet. So a couple last blockers (all on my side) before we can run this here:

  1. Release pangeo-forge-recipes. This is easy, can happen anytime.
  2. Our current @pangeo-forge-bot backend is difficult to update with new pangeo-forge-recipes releases. I'm pushing to deploy a new version which will be easier to update this week.
  3. Once the new backend is deployed, I'll need to update https://github.com/pangeo-data/pangeo-docker-images/tree/master/forge to get the latest pangeo-forge-recipes there.

Getting close here!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev (dev use only) directs registrar calls to staging api
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Proposed Recipes for LiveOcean
6 participants