Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: Dataset with schema #124

Closed
wants to merge 1 commit into from
Closed

POC: Dataset with schema #124

wants to merge 1 commit into from

Conversation

ravwojdyla
Copy link
Collaborator

@ravwojdyla ravwojdyla commented Aug 19, 2020

This is a POC. And I would like to first get your feedback about the idea, before finishing up tests, doc and coverage. This comes from the #43, follows up from the comment in #103. #43 is a lot more elaborated since it's trying to strive for statically typed feedback, whilst this is a lot more stripped down.

I have tried to:

  • not use extra wrapper around xr Dataset (since we have discussed that we would like to avoid it)
  • strive for a consistent API whilst still be concise and relatively flexible

The core idea is:

  • hold a "schema" in attrs of the xr Dataset
  • define specs of meaningful/useful arrays and validate the spec against variables in the Dataset at schema spec
  • schema then is essentially a data/array spec + pointer to variable in a Dataset
  • schema spec can point to many variables, and by default points to the default variable

Benefits:

  • A single place for definition of useful/reserved variable and their constraints
  • A single way to declare that specific variable(s) have specific meaning and spec
  • As the pipeline flows and results are merged into a single dataset, that dataset can be used for different function even if it contains custom variable names (which would be declared once)

As a user:

  • if you don't change any of the precomputed variable names, you don't need to do anything
  • if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function
  • SgkitSchema.spec returns a new Dataset (shallow copied) with updated schema/attrs

What is missing:

  • get your feedback, and pending on your feedback:
  • polish the API (eg. make it easier to fetch a single name variables)
  • discuss and complete all the specs constraints
  • make it easier to merge DS together with schema
  • more documentation
  • more tests
  • update regenie (essentially the same as regression)

This POC removes all the required/optional variable names from function arguments, and if those need to be specified or are custom user needs to specify it via SgkitSchema.spec, alternatively we could keep those where necessary (example) and call SgkitSchema.spec inside the functions (triggering validation etc).

One more point: we could make it redundant to declare default names in schema, and if missing in schema, but requested: assume default name, check variable against the spec, and return name.

@mergify
Copy link
Contributor

mergify bot commented Aug 24, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Aug 24, 2020
@tomwhite
Copy link
Collaborator

Thanks for putting this together @ravwojdyla.

I think this is useful for doing internal type checking, and for documenting the xarray variables (what they mean, as well as their names, kinds, dimensions). Apart from the latter, I think this change should be largely invisible to the user. This is so that sgkit can be used as vanilla xarray for users, and also so that we can change things more easily in the future.

if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function

I don't think this is particularly usable, so I would not expose this method to users, and instead keep the variable name function arguments in the methods where they are immediately discoverable by users.

As a user, I'd like to be able to rename variables manually using xarray APIs calls, and then sgkit would do the necessary revalidation without me having to worry about it (or even know it has happened).

Regarding the variable catalogue and documentation, we could have a flat namespace so they all appeared in one place (although I'm not opposed to introducing some kind of logical grouping). Instead of sgkit.typing.SgkitSchema.dosage we might have sgkit.variables.dosage (again hiding implementation details), so you could easily browse all variables on the sgkit.variables page. It would be nice if each variable linked to all the methods it was used by - it certainly needs to be linked to from the method documentation.

@ravwojdyla
Copy link
Collaborator Author

ravwojdyla commented Aug 25, 2020

@tomwhite thanks for the review:

if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function

I don't think this is particularly usable, so I would not expose this method to users, and instead keep the variable name function arguments in the methods where they are immediately discoverable by users.

I agree SgkitSchema.spec is cumbersome, but it does make the API super consistent - both input and output are xr.Datasets, I see your point tho. In that case we would need to all agree that all methods must expose all variable names used in the implementation, which is not the case right now, I'm not against it, but it's something we need to agree on. Furthermore, if a user decides to use alternative names, they will likely need to pass those to every single method from sgkit they use, which means the essentially need to maintain a mapping that the current PR version builds into the "schema" attribute. Again - not against it, but something that we also need to agree we expect users to do. Alternatively we could add some helper to define user mappings(?)

It's probably fair to observe that we would like vanilla xarray to work with sgkit functions, we don't want to use wrappers, yet pure xarray does lack the "context" those methods require (at least variable names), which means we have at least 4 options:

  1. provide the context as a function argument (each function exposes all variable names as argument(s))
  2. attach context to the xarray (this PR)
  3. have a global context which users can update with custom names
  4. mix of 1 + 3, function expose all variable names, and if not provided look at the global context

Without a doubt there is more options out there, but what do you think about those above. There are obviously pluses/minuses to each of those. wdyt? (fyi @eric-czech )

Edit: I'm quite fond of the idea of wrapper/encapsulation (+ public xarray field), but out of the options above without dive into details: [2, 1, 4, 3].

@eric-czech
Copy link
Collaborator

eric-czech commented Aug 25, 2020

To some of those points:

One more point: we could make it redundant to declare default names in schema, and if missing in schema, but requested: assume default name, check variable against the spec, and return name.

Given that both xr.merge and xr.concat would drop schema attributes by default, I think it would actually be quite common to lose all the attributes between sgkit function calls unless a user is willing to backtrack and rethink the merging. I can imagine this meaning that they'd need to redeclare custom fields multiple times (unless there was some global context for it as mentioned in later comments).

A single place for definition of useful/reserved variable and their constraints

+1 to this for sure, I like the freedom we would have with something like ArrayLikeSpec.

Instead of sgkit.typing.SgkitSchema.dosage we might have sgkit.variables.dosage

+1 to that since it aligns more with pandas.options

mix of 1 + 3, function expose all variable names, and if not provided look at the global context

This reminds me of tf.name_scope. Personally, I like the idea of names being customizable in the function signatures, within a context manager, and globally. I can see that being a lot of work though. I haven't found it burdensome to rename things when necessary via Xarray so I think this one would be excellent, but also something we can put off on users for a while. I don't know when to say it should be a higher priority.

I agree SgkitSchema.spec is cumbersome, but it does make the API super consistent

I definitely agree that would force us to make the functions more consistent, though I think it is an awkward construct when mixed with a bunch of Xarray/Pandas code. I'm a +1 on finding some way to make that an internal detail.

@ravwojdyla
Copy link
Collaborator Author

@tomwhite @jeromekelleher @alimanfoo any opinions on this PR, and comments above ^

@jeromekelleher
Copy link
Collaborator

jeromekelleher commented Sep 2, 2020

I like the idea, and it's an elegant way of stating and enforcing the shared common structure of datasets. I wonder if this is a bit premature though, or perhaps even opposed to the way that things are currently evolving. From my (hands-off, admittedly) perspective we seem to be moving towards an idiom of the dataset being a "vanilla" xarray dataset that every function adds to, and can have lots of variables and be sliced and diced in various different ways. For example, stats functions will all compute a result and these results will be included as variables in the output dataset. So, assuming we have a big library with lots of functions, then is the output variable of every function going to have to be registered in the central schema? I can see this being annoying for developers and something that we would have to explain repeatedly. But, maybe this is better than the slightly anarchic way things are going at the moment 🤷

My take would be to let the library evolve in the fairly loosely coupled way that it has been moving for another few months, and to pick this up again when things have been fleshed out a bit more.

@tomwhite
Copy link
Collaborator

tomwhite commented Sep 3, 2020

At the moment we have fixed variable names, which I don't think is a problem yet.

Of the four options, I think I like option 1 best (provide the context as a function argument) - at least to start with, then move to 4 (function expose all variable names, and if not provided look at the global context).

@ravwojdyla
Copy link
Collaborator Author

Closing this because this PR is based on an internal branch. On a weekly meeting we have decided to adjust this PR to 1, and once that is done, explore if we want to explore option 4.

@ravwojdyla ravwojdyla closed this Sep 3, 2020
ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 23, 2020
ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 23, 2020
@ravwojdyla ravwojdyla mentioned this pull request Sep 23, 2020
ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 24, 2020
ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 25, 2020
ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 28, 2020
@ravwojdyla ravwojdyla deleted the rav/schema_v0_2 branch September 28, 2020 14:10
ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 30, 2020
ravwojdyla referenced this pull request in ravwojdyla/sgkit Oct 2, 2020
mergify bot referenced this pull request Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
conflict PR conflict
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants