Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PR 124: POC: Dataset with schema #197

Closed
7 tasks
quansight-bot opened this issue Aug 27, 2020 · 0 comments
Closed
7 tasks

PR 124: POC: Dataset with schema #197

quansight-bot opened this issue Aug 27, 2020 · 0 comments

Comments

@quansight-bot
Copy link

Pull request metadata

Pull request description

This is a POC. And I would like to first get your feedback about the idea, before finishing up tests, doc and coverage. This comes from the #43, follows up from the comment in #103. #43 is a lot more elaborated since it's trying to strive for statically typed feedback, whilst this is a lot more stripped down.

I have tried to:

  • not use extra wrapper around xr Dataset (since we have discussed that we would like to avoid it)
  • strive for a consistent API whilst still be concise and relatively flexible

The core idea is:

  • hold a "schema" in attrs of the xr Dataset
  • define specs of meaningful/useful arrays and validate the spec against variables in the Dataset at schema spec
  • schema then is essentially a data/array spec + pointer to variable in a Dataset
  • schema spec can point to many variables, and by default points to the default variable

Benefits:

  • A single place for definition of useful/reserved variable and their constraints
  • A single way to declare that specific variable(s) have specific meaning and spec
  • As the pipeline flows and results are merged into a single dataset, that dataset can be used for different function even if it contains custom variable names (which would be declared once)

As a user:

  • if you don't change any of the precomputed variable names, you don't need to do anything
  • if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function
  • SgkitSchema.spec returns a new Dataset (shallow copied) with updated schema/attrs

What is missing:

  • get your feedback, and pending on your feedback:
  • polish the API (eg. make it easier to fetch a single name variables)
  • discuss and complete all the specs constraints
  • make it easier to merge DS together with schema
  • more documentation
  • more tests
  • update regenie (essentially the same as regression)

This POC removes all the required/optional variable names from function arguments, and if those need to be specified or are custom user needs to specify it via SgkitSchema.spec, alternatively we could keep those where necessary (example) and call SgkitSchema.spec inside the functions (triggering validation etc).

One more point: we could make it redundant to declare default names in schema, and if missing in schema, but requested: assume default name, check variable against the spec, and return name.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant