Explore typed/schema-based Dataset options #43

alimanfoo · 2020-07-16T11:30:56Z

Raising this issue to break out discussion of API design originating in #16.

alimanfoo · 2020-07-16T11:42:18Z

For reference, scikit-allel defined a number of array classes with methods, e.g., GenotypeArray class has methods like count_alleles(), and AlleleCountsArray class has methods like is_segregating(). This is implemented via encapsulation, e.g., GenotypeArray and AlleleCountsArray are wrappers around a numpy array.

We encountered some problems with this pattern when used with dask distributed. Also some usability challenges - explaining the concept of a wrapper class to folks from an analysis background can be challenging.

When exploring designs for the skallel v2 prototype, I went for a purely functional API, i.e., no classes. Instead there are functions that are effectively methods, in the sense that they expect a certain type of input, e.g., genotypes_count_alleles() and allele_counts_is_segregating().

That has it's own trade-offs though, particularly around achieving clear function naming.

alimanfoo · 2020-07-16T12:00:25Z

One question I have is how to handle the very simple use case where you want to subset a genotype call dataset to only segregating variants.

In scikit-allel you would do something like:

gt = GenotypeArray(...)
ac = gt.count_alleles()
gt_seg = gt[ac.is_segregating()]

What would we do in sgkit?

hammer · 2020-08-05T20:14:40Z

I believe @ravwojdyla is experimenting in the api_dataset branch for those that would like to follow along.

eric-czech · 2020-08-10T19:52:34Z

Here are a few notes on some topics/discussions I had related to applying schemas to a dataset:

Integrating type annotations into Xarray directly: Support static type analysis pydata/xarray#3967
A discussion on adding a similar TypedDict annotation for pandas: Type for heterogeneous dictionaries with string keys python/typing#28 (comment)
Third-party libraries:
- nptyping: Type annotations on dimensions and dtypes for numpy arrays (glow uses this)
typish: Accessing type annotations at runtime (nptyping uses this)
typeguard: Validating type hints at runtime
The actual numpy type hint tracking issue: Type hinting / annotation (PEP 484) for ndarray, dtype, and ufunc numpy/numpy#7370
- This started out as some work in https://github.com/numpy/numpy-stubs

eric-czech · 2020-08-10T21:43:08Z

BTW re: is_segregating (@alimanfoo)

I'm not sure if this would be more efficient than basing an is_segregating function on allele counts, but this would at least be one concise way to do it directly from genotypes:

# This looks for any calls that are different from a nan-aware
# mean (which would be the same w/ no segregation)
is_segregating = (
    ds
    .assign(cgo=ds.call_genotype, cgf=ds.call_genotype >= 0)
    .assign(cgm=lambda ds: ds.cgo.weighted(ds.cgf).mean(dim=('samples', 'ploidy')))
    .pipe(lambda ds: ds.cgf & (ds.cgo != ds.cgm))
    .any(dim=('samples', 'ploidy'))
)

Gist with example like scikit-allele.is_segregating: https://gist.github.com/eric-czech/41b3827cbae91a3694fc2bf2c6d87911

This uses the xarray.core.weighted.DataArrayWeighted subclass which I haven't seen before. Pretty neat.

ravwojdyla · 2020-08-13T16:31:52Z

Current POC branch has two parts:

Schema definition and validation
Generic container for xarray

Schema definition and validation:

DatasetType captures validation logic and is the root of the type category for datasets
Each dataset defines a set of constraints on the variables and types which get validates at the construction time
DatasetType holds the central validation of containers (xarray), each type just defines constraints and root validates them
Each method could provide a “schema” for both input and output, containers could have more data, but at least the input and output is validated
DatasetType serves as a great documentation via code, which I see as a big advantage

Generic container for xarray:

SgkitDataset as the xr.Dataset generic container
xr.Dataset is easily available as a public field, but we could also delegate some more useful methods to xr.Dataset if we would like to, but that part is not part of the current state
Type on the SgkitDataset allows for type safety at the method invocation, but at the end of the data in python runtime is king
Nice part about the generic nature of the SgkitDataset container is that it’s optional, and if someone doesn’t care about typing, you don’t need to specify it and just not care about it at all
Users would not call CTOR of the SgkitDataset, and instead use the type based factory method
Each computation method strictly defines input, and likely the first line of code in each computation would be to access the xrarray dataset, which is by design

Going forward I will submit a PR for only the 1st part, and we can decide about the 2nd part later. Does this sound ok with both of you @tomwhite @eric-czech? This obviously doesn't mean that we will merge the 1st PR, just that we can have a more concrete discussion about the implementation details etc.

eric-czech · 2020-08-14T14:34:38Z

Sounds good @ravwojdyla. That first part would include the schema definitions for each function result too right (i.e. your changes to count_alleles, gwas_linear_regression etc.)?

ravwojdyla · 2020-08-14T14:38:01Z

@eric-czech thanks - yep, that's correct. Right out of the way if there are any comments you have or @tomwhite to the current declaration, please let me know (example).

tomwhite · 2020-08-17T08:18:07Z

+1 I think schema definition and validation is the more valuable part here.

ravwojdyla · 2020-08-20T10:26:32Z

Renamed this issue to better capture the core of the problem/issue ^

hammer · 2021-10-08T12:23:06Z

An interesting discussion on the use of Pydantic with Xarray happening at openclimatefix/nowcasting_dataset#211, in case we ever revisit this topic in the future! On the Xarray side the relevant issue is pydata/xarray#1900.

hammer · 2021-12-07T16:27:16Z

https://github.com/carbonplan/xarray-schema looks relevant

alimanfoo mentioned this issue Jul 16, 2020

GWAS linear regression implementation #16

Merged

ravwojdyla added core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. enhancement New feature or request labels Jul 16, 2020

ravwojdyla self-assigned this Jul 20, 2020

hammer mentioned this issue Aug 5, 2020

[WIP] Docs describing the Genotype Call XArray #78

Closed

eric-czech mentioned this issue Aug 11, 2020

Append output variables from functions to input dataset #103

Closed

ravwojdyla mentioned this issue Aug 19, 2020

POC: Dataset with schema #124

Closed

7 tasks

ravwojdyla linked a pull request Aug 19, 2020 that will close this issue

POC: Dataset with schema #124

Closed

7 tasks

ravwojdyla removed a link to a pull request Aug 19, 2020

POC: Dataset with schema #124

Closed

7 tasks

ravwojdyla changed the title ~~Domain-specific classes and/or methods~~ Explore typed/schema-based Dataset options Aug 20, 2020

ravwojdyla mentioned this issue Sep 23, 2020

Dataset "schema" v0.3 #276

Merged

ravwojdyla closed this as completed Oct 2, 2020

hammer mentioned this issue May 27, 2021

Upstream failing due to Xarray dims ordering change #583

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore typed/schema-based Dataset options #43

Explore typed/schema-based Dataset options #43

alimanfoo commented Jul 16, 2020

alimanfoo commented Jul 16, 2020 •

edited

alimanfoo commented Jul 16, 2020

hammer commented Aug 5, 2020

eric-czech commented Aug 10, 2020

eric-czech commented Aug 10, 2020

ravwojdyla commented Aug 13, 2020 •

edited

eric-czech commented Aug 14, 2020

ravwojdyla commented Aug 14, 2020

tomwhite commented Aug 17, 2020

ravwojdyla commented Aug 20, 2020

hammer commented Oct 8, 2021 •

edited

hammer commented Dec 7, 2021

Explore typed/schema-based Dataset options #43

Explore typed/schema-based Dataset options #43

Comments

alimanfoo commented Jul 16, 2020

alimanfoo commented Jul 16, 2020 • edited

alimanfoo commented Jul 16, 2020

hammer commented Aug 5, 2020

eric-czech commented Aug 10, 2020

eric-czech commented Aug 10, 2020

ravwojdyla commented Aug 13, 2020 • edited

eric-czech commented Aug 14, 2020

ravwojdyla commented Aug 14, 2020

tomwhite commented Aug 17, 2020

ravwojdyla commented Aug 20, 2020

hammer commented Oct 8, 2021 • edited

hammer commented Dec 7, 2021

alimanfoo commented Jul 16, 2020 •

edited

ravwojdyla commented Aug 13, 2020 •

edited

hammer commented Oct 8, 2021 •

edited