PR 124: POC: Dataset with schema #197

quansight-bot · 2020-08-27T13:49:49Z

Pull request metadata

url: https://github.com/pystatgen/sgkit/pull/124
state: open
milestone: No milestone assigned
created by: Rafal Wojdyla @ravwojdyla
labels:
- conflict
assigned:

Pull request description

This is a POC. And I would like to first get your feedback about the idea, before finishing up tests, doc and coverage. This comes from the #43, follows up from the comment in #103. #43 is a lot more elaborated since it's trying to strive for statically typed feedback, whilst this is a lot more stripped down.

I have tried to:

not use extra wrapper around xr Dataset (since we have discussed that we would like to avoid it)
strive for a consistent API whilst still be concise and relatively flexible

The core idea is:

hold a "schema" in attrs of the xr Dataset
define specs of meaningful/useful arrays and validate the spec against variables in the Dataset at schema spec
schema then is essentially a data/array spec + pointer to variable in a Dataset
schema spec can point to many variables, and by default points to the default variable

Benefits:

A single place for definition of useful/reserved variable and their constraints
A single way to declare that specific variable(s) have specific meaning and spec
As the pipeline flows and results are merged into a single dataset, that dataset can be used for different function even if it contains custom variable names (which would be declared once)

As a user:

if you don't change any of the precomputed variable names, you don't need to do anything
if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function
SgkitSchema.spec returns a new Dataset (shallow copied) with updated schema/attrs

What is missing:

get your feedback, and pending on your feedback:
polish the API (eg. make it easier to fetch a single name variables)
discuss and complete all the specs constraints
make it easier to merge DS together with schema
more documentation
more tests
update regenie (essentially the same as regression)

This POC removes all the required/optional variable names from function arguments, and if those need to be specified or are custom user needs to specify it via SgkitSchema.spec, alternatively we could keep those where necessary (example) and call SgkitSchema.spec inside the functions (triggering validation etc).

One more point: we could make it redundant to declare default names in schema, and if missing in schema, but requested: assume default name, check variable against the spec, and return name.

The text was updated successfully, but these errors were encountered:

quansight-bot closed this as completed Aug 27, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PR 124: POC: Dataset with schema #197

PR 124: POC: Dataset with schema #197

quansight-bot commented Aug 27, 2020

PR 124: POC: Dataset with schema #197

PR 124: POC: Dataset with schema #197

Comments

quansight-bot commented Aug 27, 2020

Pull request metadata

Pull request description