Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explore typed/schema-based Dataset options #43

Closed
alimanfoo opened this issue Jul 16, 2020 · 12 comments
Closed

Explore typed/schema-based Dataset options #43

alimanfoo opened this issue Jul 16, 2020 · 12 comments
Assignees
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. enhancement New feature or request

Comments

@alimanfoo
Copy link
Collaborator

Raising this issue to break out discussion of API design originating in #16.

@alimanfoo
Copy link
Collaborator Author

alimanfoo commented Jul 16, 2020

For reference, scikit-allel defined a number of array classes with methods, e.g., GenotypeArray class has methods like count_alleles(), and AlleleCountsArray class has methods like is_segregating(). This is implemented via encapsulation, e.g., GenotypeArray and AlleleCountsArray are wrappers around a numpy array.

We encountered some problems with this pattern when used with dask distributed. Also some usability challenges - explaining the concept of a wrapper class to folks from an analysis background can be challenging.

When exploring designs for the skallel v2 prototype, I went for a purely functional API, i.e., no classes. Instead there are functions that are effectively methods, in the sense that they expect a certain type of input, e.g., genotypes_count_alleles() and allele_counts_is_segregating().

That has it's own trade-offs though, particularly around achieving clear function naming.

@alimanfoo
Copy link
Collaborator Author

One question I have is how to handle the very simple use case where you want to subset a genotype call dataset to only segregating variants.

In scikit-allel you would do something like:

gt = GenotypeArray(...)
ac = gt.count_alleles()
gt_seg = gt[ac.is_segregating()]

What would we do in sgkit?

@ravwojdyla ravwojdyla added core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. enhancement New feature or request labels Jul 16, 2020
@ravwojdyla ravwojdyla self-assigned this Jul 20, 2020
@hammer
Copy link
Contributor

hammer commented Aug 5, 2020

I believe @ravwojdyla is experimenting in the api_dataset branch for those that would like to follow along.

@eric-czech
Copy link
Collaborator

Here are a few notes on some topics/discussions I had related to applying schemas to a dataset:

@eric-czech
Copy link
Collaborator

BTW re: is_segregating (@alimanfoo)

I'm not sure if this would be more efficient than basing an is_segregating function on allele counts, but this would at least be one concise way to do it directly from genotypes:

# This looks for any calls that are different from a nan-aware
# mean (which would be the same w/ no segregation)
is_segregating = (
    ds
    .assign(cgo=ds.call_genotype, cgf=ds.call_genotype >= 0)
    .assign(cgm=lambda ds: ds.cgo.weighted(ds.cgf).mean(dim=('samples', 'ploidy')))
    .pipe(lambda ds: ds.cgf & (ds.cgo != ds.cgm))
    .any(dim=('samples', 'ploidy'))
)

Gist with example like scikit-allele.is_segregating: https://gist.github.com/eric-czech/41b3827cbae91a3694fc2bf2c6d87911

This uses the xarray.core.weighted.DataArrayWeighted subclass which I haven't seen before. Pretty neat.

@ravwojdyla
Copy link
Collaborator

ravwojdyla commented Aug 13, 2020

Current POC branch has two parts:

  • Schema definition and validation
  • Generic container for xarray

Schema definition and validation:

  • DatasetType captures validation logic and is the root of the type category for datasets
  • Each dataset defines a set of constraints on the variables and types which get validates at the construction time
  • DatasetType holds the central validation of containers (xarray), each type just defines constraints and root validates them
  • Each method could provide a “schema” for both input and output, containers could have more data, but at least the input and output is validated
  • DatasetType serves as a great documentation via code, which I see as a big advantage

Generic container for xarray:

  • SgkitDataset as the xr.Dataset generic container
  • xr.Dataset is easily available as a public field, but we could also delegate some more useful methods to xr.Dataset if we would like to, but that part is not part of the current state
  • Type on the SgkitDataset allows for type safety at the method invocation, but at the end of the data in python runtime is king
  • Nice part about the generic nature of the SgkitDataset container is that it’s optional, and if someone doesn’t care about typing, you don’t need to specify it and just not care about it at all
  • Users would not call CTOR of the SgkitDataset, and instead use the type based factory method
  • Each computation method strictly defines input, and likely the first line of code in each computation would be to access the xrarray dataset, which is by design

Going forward I will submit a PR for only the 1st part, and we can decide about the 2nd part later. Does this sound ok with both of you @tomwhite @eric-czech? This obviously doesn't mean that we will merge the 1st PR, just that we can have a more concrete discussion about the implementation details etc.

@eric-czech
Copy link
Collaborator

Sounds good @ravwojdyla. That first part would include the schema definitions for each function result too right (i.e. your changes to count_alleles, gwas_linear_regression etc.)?

@ravwojdyla
Copy link
Collaborator

@eric-czech thanks - yep, that's correct. Right out of the way if there are any comments you have or @tomwhite to the current declaration, please let me know (example).

@tomwhite
Copy link
Collaborator

+1 I think schema definition and validation is the more valuable part here.

@ravwojdyla ravwojdyla linked a pull request Aug 19, 2020 that will close this issue
7 tasks
@ravwojdyla ravwojdyla removed a link to a pull request Aug 19, 2020
7 tasks
@ravwojdyla ravwojdyla changed the title Domain-specific classes and/or methods Explore typed/schema-based Dataset options Aug 20, 2020
@ravwojdyla
Copy link
Collaborator

Renamed this issue to better capture the core of the problem/issue ^

@hammer
Copy link
Contributor

hammer commented Oct 8, 2021

An interesting discussion on the use of Pydantic with Xarray happening at openclimatefix/nowcasting_dataset#211, in case we ever revisit this topic in the future! On the Xarray side the relevant issue is pydata/xarray#1900.

@hammer
Copy link
Contributor

hammer commented Dec 7, 2021

https://github.com/carbonplan/xarray-schema looks relevant

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core operations Issues related to domain-specific functionality such as LD pruning, PCA, association testing, etc. enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants