Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Discovery of Calculable Values #30

Open
DocOtak opened this issue Feb 25, 2022 · 11 comments
Open

Auto Discovery of Calculable Values #30

DocOtak opened this issue Feb 25, 2022 · 11 comments
Assignees
Milestone

Comments

@DocOtak
Copy link
Owner

DocOtak commented Feb 25, 2022

I was thinking of a feature that would be nice to have (in a short / long term future): being able to give a dataset ds to the gsw wrapped functions, and gsw-xarray would then get from ds the necessary dataarrays, based on their standard_name. So one could use: gsw.sigma0(ds). This can lead to many problems (e.g. how to deal with datasets containing more than 1 array for 1 standard_name), but we can keep the discussion on how to solve these problems for later, if we decide to implement this feature.

Originally posted by @rcaneill in #1 (comment)

@DocOtak
Copy link
Owner Author

DocOtak commented Feb 25, 2022

Totally agree this would be awesome, and should be attempted (iirc it was one of our original goals).

Initial thoughts:

  • It might be good to rely on the cf-xarray package for attribute discovery
  • Many of the of the inputs to gsw functions are the outputs of other functions. For example, in observational datasets, you'll never have CT or SA. So when I ask for rho, I probably want this to automatically calculate the needed CT and SA values from the in situ temperature, practical salinity, pressure, and lat/lon.

@rcaneill rcaneill added this to the Version 0.3.0 milestone Feb 26, 2022
@rcaneill
Copy link
Collaborator

I think that they are here 2 slightly different features:

  1. A simple one, where gsw.sigma0(ds) would take SA and CT from ds (but they need to exists)
  2. A more complex one, where gsw_xarray tries to find a path from the existing variable to what is needed (here SA and CT). This seems to look like what GNU make is trying to resolve. Without needing to use a so complex tool, we can get ideas from the way it works

@rcaneill rcaneill modified the milestones: Version 0.3.0, Version 0.4.0 Mar 23, 2022
@rcaneill
Copy link
Collaborator

When trying to implement this feature, we will have a problem: many inputs of the gsw functions don't have cf standard names (e.g. saturation_fraction, entropy, enthalpy, etc). I see 2 ways to handle this:

  1. Open a request to add them to cf convention (I have never done this so I don't know how it works)
  2. Use a custom criteria with cf-xarray https://cf-xarray.readthedocs.io/en/latest/custom-criteria.html
    I think that these 2 options can be used together, as I guess that adding things to cf standard names can take some time

@rcaneill rcaneill self-assigned this Jul 7, 2022
@rcaneill
Copy link
Collaborator

rcaneill commented Jul 7, 2022

I am on my way to implement this

@rcaneill
Copy link
Collaborator

rcaneill commented Jul 7, 2022

About the API, the way I see it would be:
gsw_xarray.sigma0(ds) with ds a dataset containing (hopefully) all dataarrays necessary to compute sigma0, even with some extra steps. Does is make sense to do it this way?
It will raise a ValueError if the dataset does not contain enough information.

I guess we can also add an extra argument: gsw_xarray.sigma0(ds, inplace=True) (or False) to return a new dataset or to add the dataarrays into ds.

Question: if one user has h (enthalpy), z, lat, and SA in ds and wants to compute sigma0 (it is possible), should we save the intermediate variables necessary (e.g. CT) in ds, or should we only return the final result? I think that both options make sense: 1st option is good if the user asks again for sigma1, sigma2, sigma3 because then CT will be already computed. This could be controlled by another argument (e.g. intermediate_variables=True).

@rcaneill
Copy link
Collaborator

rcaneill commented Jul 7, 2022

Here is my WIP notebook, with the algo I developed. It is not perfect, but is works well! (it is for now based on names in ds, not on standard names)

https://gist.github.com/rcaneill/0aa8b9e72112d079c4919e462a4bb378

@rcaneill
Copy link
Collaborator

rcaneill commented Jul 7, 2022

I tried the other option, i.e. starting from the variable we want, and going backward through the graph, but in the end it was not working (easy to get trapped in cycles), and has around 10 times more lines of code. So I thing that it is better to do it the way I do in the notebook.

@dcherian
Copy link

dcherian commented Jul 7, 2022

gsw_xarray.sigma0(ds, inplace=True) could easily just be ds.merge(gsw_xarray.sigma0(ds)), so you don't need to support the inplace keyword

@DocOtak
Copy link
Owner Author

DocOtak commented Jul 14, 2022

That looks neat. I'll need to play with it a bit to understand the graph, but looks like a good starting place.

Some thoughts in no particular order:

  • ODV makes you specify your "key variables" when you are working with a data collection, basically you tell it which specific variable is the one it should use as, e.g. the practical salinity. My datafiles tend to have two channels of salinity, and often bottle salinity in addition to the CTD ones. We need some way of selecting the variables to use for calculations in the event of ambiguity/duplicates.
  • I wonder how well an implicit graph based on xarray accessors might work... if we made a gsw dataset accessor which let you lookup GSW properties as keys...

Mock example attempt:

from gsw_xarray import accessors # or whatever it needed to register the xarray accessor under the gsw namespace

ds = xr.load_dataset("some_dataset.nc")
# now we use the accessor to get GSW properties, things that need intermediate calculations should just call the accessor itself
SA = ds.gsw["SA"]  # uses PSAL or whatever it needs

rho = ds.gsw["rho"]  # internally calls ds.gsw["SA"], and ds.gsw["CT"]

@rcaneill
Copy link
Collaborator

  • ODV makes you specify your "key variables" when you are working with a data collection, basically you tell it which specific variable is the one it should use as, e.g. the practical salinity. My datafiles tend to have two channels of salinity, and often bottle salinity in addition to the CTD ones. We need some way of selecting the variables to use for calculations in the event of ambiguity/duplicates.

Do you have any precise idea for this?

  • I wonder how well an implicit graph based on xarray accessors might work... if we made a gsw dataset accessor which let you lookup GSW properties as keys...

My guess is that step would be quite strait forward as soon as the function behind is written (I never wrote a xarray accessor before so indeed it will not be so easy)

@rcaneill
Copy link
Collaborator

rcaneill commented Jul 29, 2022

While working on this I realized that I need to write the option to work with dataset before (e.g. gsw.sigma0(ds) and we take ds.SA and ds.CT based on standard name).
Because of TEOS-10/GSW-Python#97, this becomes not so easy for me to know if I should store the detected variables onto args or kwargs. Any thoughts on this?

You can answer this in PR #53

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants