POC: Dataset with schema #124

ravwojdyla · 2020-08-19T15:19:13Z

This is a POC. And I would like to first get your feedback about the idea, before finishing up tests, doc and coverage. This comes from the #43, follows up from the comment in #103. #43 is a lot more elaborated since it's trying to strive for statically typed feedback, whilst this is a lot more stripped down.

I have tried to:

not use extra wrapper around xr Dataset (since we have discussed that we would like to avoid it)
strive for a consistent API whilst still be concise and relatively flexible

The core idea is:

hold a "schema" in attrs of the xr Dataset
define specs of meaningful/useful arrays and validate the spec against variables in the Dataset at schema spec
schema then is essentially a data/array spec + pointer to variable in a Dataset
schema spec can point to many variables, and by default points to the default variable

Benefits:

A single place for definition of useful/reserved variable and their constraints
A single way to declare that specific variable(s) have specific meaning and spec
As the pipeline flows and results are merged into a single dataset, that dataset can be used for different function even if it contains custom variable names (which would be declared once)

As a user:

if you don't change any of the precomputed variable names, you don't need to do anything
if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function
SgkitSchema.spec returns a new Dataset (shallow copied) with updated schema/attrs

What is missing:

get your feedback, and pending on your feedback:
polish the API (eg. make it easier to fetch a single name variables)
discuss and complete all the specs constraints
make it easier to merge DS together with schema
more documentation
more tests
update regenie (essentially the same as regression)

This POC removes all the required/optional variable names from function arguments, and if those need to be specified or are custom user needs to specify it via SgkitSchema.spec, alternatively we could keep those where necessary (example) and call SgkitSchema.spec inside the functions (triggering validation etc).

One more point: we could make it redundant to declare default names in schema, and if missing in schema, but requested: assume default name, check variable against the spec, and return name.

POC

mergify · 2020-08-24T10:04:25Z

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

tomwhite · 2020-08-24T12:12:24Z

Thanks for putting this together @ravwojdyla.

I think this is useful for doing internal type checking, and for documenting the xarray variables (what they mean, as well as their names, kinds, dimensions). Apart from the latter, I think this change should be largely invisible to the user. This is so that sgkit can be used as vanilla xarray for users, and also so that we can change things more easily in the future.

if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function

I don't think this is particularly usable, so I would not expose this method to users, and instead keep the variable name function arguments in the methods where they are immediately discoverable by users.

As a user, I'd like to be able to rename variables manually using xarray APIs calls, and then sgkit would do the necessary revalidation without me having to worry about it (or even know it has happened).

Regarding the variable catalogue and documentation, we could have a flat namespace so they all appeared in one place (although I'm not opposed to introducing some kind of logical grouping). Instead of sgkit.typing.SgkitSchema.dosage we might have sgkit.variables.dosage (again hiding implementation details), so you could easily browse all variables on the sgkit.variables page. It would be nice if each variable linked to all the methods it was used by - it certainly needs to be linked to from the method documentation.

ravwojdyla · 2020-08-25T16:26:09Z

@tomwhite thanks for the review:

if a function requires that you specify which variables to use for computation, you must do so via SgkitSchema.spec before calling the function

I don't think this is particularly usable, so I would not expose this method to users, and instead keep the variable name function arguments in the methods where they are immediately discoverable by users.

I agree SgkitSchema.spec is cumbersome, but it does make the API super consistent - both input and output are xr.Datasets, I see your point tho. In that case we would need to all agree that all methods must expose all variable names used in the implementation, which is not the case right now, I'm not against it, but it's something we need to agree on. Furthermore, if a user decides to use alternative names, they will likely need to pass those to every single method from sgkit they use, which means the essentially need to maintain a mapping that the current PR version builds into the "schema" attribute. Again - not against it, but something that we also need to agree we expect users to do. Alternatively we could add some helper to define user mappings(?)

It's probably fair to observe that we would like vanilla xarray to work with sgkit functions, we don't want to use wrappers, yet pure xarray does lack the "context" those methods require (at least variable names), which means we have at least 4 options:

provide the context as a function argument (each function exposes all variable names as argument(s))
attach context to the xarray (this PR)
have a global context which users can update with custom names
mix of 1 + 3, function expose all variable names, and if not provided look at the global context

Without a doubt there is more options out there, but what do you think about those above. There are obviously pluses/minuses to each of those. wdyt? (fyi @eric-czech )

Edit: I'm quite fond of the idea of wrapper/encapsulation (+ public xarray field), but out of the options above without dive into details: [2, 1, 4, 3].

eric-czech · 2020-08-25T17:54:08Z

To some of those points:

One more point: we could make it redundant to declare default names in schema, and if missing in schema, but requested: assume default name, check variable against the spec, and return name.

Given that both xr.merge and xr.concat would drop schema attributes by default, I think it would actually be quite common to lose all the attributes between sgkit function calls unless a user is willing to backtrack and rethink the merging. I can imagine this meaning that they'd need to redeclare custom fields multiple times (unless there was some global context for it as mentioned in later comments).

A single place for definition of useful/reserved variable and their constraints

+1 to this for sure, I like the freedom we would have with something like ArrayLikeSpec.

Instead of sgkit.typing.SgkitSchema.dosage we might have sgkit.variables.dosage

+1 to that since it aligns more with pandas.options

mix of 1 + 3, function expose all variable names, and if not provided look at the global context

This reminds me of tf.name_scope. Personally, I like the idea of names being customizable in the function signatures, within a context manager, and globally. I can see that being a lot of work though. I haven't found it burdensome to rename things when necessary via Xarray so I think this one would be excellent, but also something we can put off on users for a while. I don't know when to say it should be a higher priority.

I agree SgkitSchema.spec is cumbersome, but it does make the API super consistent

I definitely agree that would force us to make the functions more consistent, though I think it is an awkward construct when mixed with a bunch of Xarray/Pandas code. I'm a +1 on finding some way to make that an internal detail.

ravwojdyla · 2020-09-02T16:34:51Z

@tomwhite @jeromekelleher @alimanfoo any opinions on this PR, and comments above ^

jeromekelleher · 2020-09-02T16:53:30Z

I like the idea, and it's an elegant way of stating and enforcing the shared common structure of datasets. I wonder if this is a bit premature though, or perhaps even opposed to the way that things are currently evolving. From my (hands-off, admittedly) perspective we seem to be moving towards an idiom of the dataset being a "vanilla" xarray dataset that every function adds to, and can have lots of variables and be sliced and diced in various different ways. For example, stats functions will all compute a result and these results will be included as variables in the output dataset. So, assuming we have a big library with lots of functions, then is the output variable of every function going to have to be registered in the central schema? I can see this being annoying for developers and something that we would have to explain repeatedly. But, maybe this is better than the slightly anarchic way things are going at the moment 🤷

My take would be to let the library evolve in the fairly loosely coupled way that it has been moving for another few months, and to pick this up again when things have been fleshed out a bit more.

tomwhite · 2020-09-03T14:50:16Z

At the moment we have fixed variable names, which I don't think is a problem yet.

Of the four options, I think I like option 1 best (provide the context as a function argument) - at least to start with, then move to 4 (function expose all variable names, and if not provided look at the global context).

ravwojdyla · 2020-09-03T16:04:06Z

Closing this because this PR is based on an internal branch. On a weekly meeting we have decided to adjust this PR to 1, and once that is done, explore if we want to explore option 4.

See https://github.com/pystatgen/sgkit/pull/124 for more context

Dataset with schema

d7460b5

POC

This was linked to issues Aug 19, 2020

Explore typed/schema-based Dataset options #43

Closed

Append output variables from functions to input dataset #103

Closed

This was unlinked from issues Aug 19, 2020

Explore typed/schema-based Dataset options #43

Closed

Append output variables from functions to input dataset #103

Closed

eric-czech mentioned this pull request Aug 19, 2020

[WIP] count_allele_calls #114

Merged

mergify bot added the conflict PR conflict label Aug 24, 2020

ravwojdyla closed this Sep 3, 2020

eric-czech mentioned this pull request Sep 14, 2020

Mixed ploidy genotype_call_dataset #243

Closed

ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 23, 2020

Add SgkitVariables

c9af61b

See https://github.com/pystatgen/sgkit/pull/124 for more context

ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 23, 2020

Add SgkitVariables

8d25866

See https://github.com/pystatgen/sgkit/pull/124 for more context

ravwojdyla mentioned this pull request Sep 23, 2020

Dataset "schema" v0.3 #276

Merged

ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 24, 2020

Add SgkitVariables

4d9aabb

See https://github.com/pystatgen/sgkit/pull/124 for more context

ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 25, 2020

Add SgkitVariables

67789f6

See https://github.com/pystatgen/sgkit/pull/124 for more context

ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 28, 2020

Add SgkitVariables

274c989

See https://github.com/pystatgen/sgkit/pull/124 for more context

ravwojdyla deleted the rav/schema_v0_2 branch September 28, 2020 14:10

ravwojdyla referenced this pull request in ravwojdyla/sgkit Sep 30, 2020

Add SgkitVariables

636c5e0

See https://github.com/pystatgen/sgkit/pull/124 for more context

ravwojdyla referenced this pull request in ravwojdyla/sgkit Oct 2, 2020

Add SgkitVariables

22fd605

See https://github.com/pystatgen/sgkit/pull/124 for more context

mergify bot referenced this pull request Oct 2, 2020

Add SgkitVariables

9c1c5b4

See https://github.com/pystatgen/sgkit/pull/124 for more context

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POC: Dataset with schema #124

POC: Dataset with schema #124

ravwojdyla commented Aug 19, 2020 •

edited

mergify bot commented Aug 24, 2020

tomwhite commented Aug 24, 2020

ravwojdyla commented Aug 25, 2020 •

edited

eric-czech commented Aug 25, 2020 •

edited

ravwojdyla commented Sep 2, 2020

jeromekelleher commented Sep 2, 2020 •

edited

tomwhite commented Sep 3, 2020

ravwojdyla commented Sep 3, 2020

POC: Dataset with schema #124

POC: Dataset with schema #124

Conversation

ravwojdyla commented Aug 19, 2020 • edited

mergify bot commented Aug 24, 2020

tomwhite commented Aug 24, 2020

ravwojdyla commented Aug 25, 2020 • edited

eric-czech commented Aug 25, 2020 • edited

ravwojdyla commented Sep 2, 2020

jeromekelleher commented Sep 2, 2020 • edited

tomwhite commented Sep 3, 2020

ravwojdyla commented Sep 3, 2020

ravwojdyla commented Aug 19, 2020 •

edited

ravwojdyla commented Aug 25, 2020 •

edited

eric-czech commented Aug 25, 2020 •

edited

jeromekelleher commented Sep 2, 2020 •

edited