Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset "schema" v0.3 #276

Merged
merged 10 commits into from Oct 2, 2020
Merged

Dataset "schema" v0.3 #276

merged 10 commits into from Oct 2, 2020

Conversation

ravwojdyla
Copy link
Collaborator

@ravwojdyla ravwojdyla commented Sep 23, 2020

Re: #43
Resurrect: #124 (please read it for context)

Depends on #274

@mergify
Copy link
Contributor

mergify bot commented Sep 24, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 24, 2020
@mergify mergify bot removed the conflict PR conflict label Sep 24, 2020
"""Compute quality control variant statistics from genotype calls.

Parameters
----------
ds
Genotype call dataset such as from
`sgkit.create_genotype_call_dataset`.
call_genotype
Input variable name holding call_genotype.
As defined by `sgkit.variables.call_genotype`.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably decide on how we want to do this linking, and what kind of information goes into method doc vs variable docs. I would actually suggest we do this in a separate PR, to keep review slim. But if people feel strong to do it here, fine as well.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI added links between methods and variables in the documentation.

"""TODO"""
genotype_counts = ArrayLikeSpec("genotype_counts", ndim=2, kind="i")
"""
Genotype counts. Must correspond to an (`N`, 3) array where `N` is equal
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, the 1st sentence appears on the front page of the variable, the rest appears on the variable specific page.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment at the top of these definitions saying this?

ndim: Union[None, int, Set[int]] = None


call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should decide if we want to list this in a single namespace, group these, and or what order do we want them documented. Also would be good to have bidirectional links between variables and methods.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 to making it possible to see what methods use or produce a variable.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to list variables alphabetically, since they don't map neatly to methods (so aren't easily grouped by method; also you can see which variables a method uses or produces by looking at the doc for the method). (BTW, I think we should probably break out methods by type: e.g. basic stats/GWAS/popgen - but that is a separate issue.)

Everything so far is in a top-level namespace, so we could continue that here (i.e. eliminate the variables prefix), but I'm not sure which way to go on that.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tomwhite here

call_genotype_probability_mask = ArrayLikeSpec(
"call_genotype_probability_mask", kind="b", ndim=3
)
"""TODO"""
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are some TODO's still where the doc was missing at the origin.

sgkit/model.py Outdated
@@ -83,7 +79,9 @@ def create_genotype_call_dataset(
check_array_like(variant_id, kind={"U", "O"}, ndim=1)
data_vars["variant_id"] = ([DIM_VARIANT], variant_id)
attrs: Dict[Hashable, Any] = {"contigs": variant_contig_names}
return xr.Dataset(data_vars=data_vars, attrs=attrs)
return variables.validate(
xr.Dataset(data_vars=data_vars, attrs=attrs), *data_vars.keys()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason not to have a validate overload that checks against all sgkit.variables by default (to avoid *data_vars.keys)?

@@ -140,6 +144,12 @@ def hardy_weinberg_test(
where `N` is equal to the number of variants and the 3 columns contain
heterozygous, homozygous reference, and homozygous alternate counts
(in that order) across all samples for a variant.
call_genotype
Input variable name holding call_genotype.
As defined by :data:`sgkit.variables.call_genotype`.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: This isn't a complete sentence which will be a little odd given how common "As defined by X." is. I would suggest "Defined by X." instead.

Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM other than a couple minor things. Thanks for cleaning up all those signatures! Much more consistent now.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great - thanks @ravwojdyla. I have a few suggestions (mostly minor), but this should be ready to merge soon.

One thing worth mentioning is that users writing their own methods do not have to use the validation if they don't want to (in fact, it's not really a part of the public API so it's not easy to use). I think that's fine - we want to encourage users to take advantage of xarray APIs as needed - and this work is to make sgkit's methods more internally consistent and well-documented.

"""TODO"""
genotype_counts = ArrayLikeSpec("genotype_counts", ndim=2, kind="i")
"""
Genotype counts. Must correspond to an (`N`, 3) array where `N` is equal
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a comment at the top of these definitions saying this?

@@ -113,17 +119,24 @@ def count_call_alleles(ds: Dataset, merge: bool = True) -> Dataset:
)
}
)
return conditional_merge_datasets(ds, new_ds, merge)
return variables.validate(
conditional_merge_datasets(ds, new_ds, merge), "call_allele_count"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should output variable names reference the spec too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite this is outdated via 47eef29

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, but the question still remains where the dataset is created. A typo there won't cause validation to fail will it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite a typo there would definitely cause a validation error (which is intentional/good). Whether we should reference variable or use strings is a good question, but more from a "dev practices" standpoint. I guess I lean towards referencing it, BUT it would mean that you need to for example say: variables.pc_relate_phi.default_name instead of "pc_relate_phi". We could shorten it to variables.pc_relate_phi, if we provide some magic methods (hash/str).

I don't really see much value of using the reference from the user perspective, since the validation works anyway. One thing that comes to me mind tho, is that maybe if we reference in all the methods, maybe there is some plugin that could generate the links for usage, so that we could link variables to methods 🤔

wdyt?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I would probably lean towards referencing the variable too.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for posterity, I've commented on the wrong comment section, copying over here:

@tomwhite so I just remembered something, this would be going against some early comments in #17, what do you think? also ping @eric-czech

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using sgkit.variables for creating the output datasets sounds good. The closest thing to an edge case I can think of right now is a function that generates a mask or mean imputes one of several possible input variables. Perhaps validation could eventually be more dynamic for those cases. Otherwise, I can't imagine where using a constant would be awkward.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite @eric-czech ok, this PR is now using sgkit.variables pretty much everywhere we need a variable name (and this PR is ready for review). I did not like the overly verbose variables.<varname>.default_name, and making Spec hashable and acting like a string introduces complexity and will likely lead to issues in xr (see here), so I've opted for the solution you can see in 6271e89.

ndim: Union[None, int, Set[int]] = None


call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My preference is to list variables alphabetically, since they don't map neatly to methods (so aren't easily grouped by method; also you can see which variables a method uses or produces by looking at the doc for the method). (BTW, I think we should probably break out methods by type: e.g. basic stats/GWAS/popgen - but that is a separate issue.)

Everything so far is in a top-level namespace, so we could continue that here (i.e. eliminate the variables prefix), but I'm not sure which way to go on that.

"""
variant_hwe_p_value = ArrayLikeSpec("variant_hwe_p_value", kind="f")
"""P values from HWE test for each variant as float in [0, 1]."""
variant_beta = ArrayLikeSpec("variant_beta")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should call this variant_hwe_beta? (Similarly for a few others.)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this and the other "naming" comments below + all the TODOs in the docs, I suggest we open a separate issue and follow up in separate PR(s). One of the nice side effects of having all those variables in the same place is that we can see the inconsistencies etc. is that fine with you @tomwhite ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, fine by me

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite so I just remembered something, this would be going against some early comments in https://github.com/pystatgen/sgkit/issues/17, what do you think? also ping @eric-czech

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravwojdyla do you mean the comment about variables.pc_relate_phi.default_name vs "pc_relate_phi"? OK, happy to leave as a string.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tomwhite right, wrong comment section, sorry, this should have been in https://github.com/pystatgen/sgkit/pull/276#discussion_r495782207

Let's see what @eric-czech thinks. Other than that, afaiu @tomwhite this is ready for another review.

"""Sample PCs (PCxS)."""
pc_relate_phi = ArrayLikeSpec("pc_relate_phi", ndim=2, kind="f")
"""PC Relate kinship coefficient matrix."""
base_prediction = ArrayLikeSpec("base_prediction", ndim=4, kind="f")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regenie_base_prediction?

homozygous reference, and homozygous alternate counts (in that order)
across all samples for a variant.
"""
call_allele_count = ArrayLikeSpec("call_allele_count", ndim=3, kind="u")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should make genotype_counts/call_allele_count consistent. Perhaps in a follow-up PR though.

@@ -51,16 +52,20 @@ def count_alleles(g: ArrayLike, _: ArrayLike, out: ArrayLike) -> None:
out[a] += 1


def count_call_alleles(ds: Dataset, merge: bool = True) -> Dataset:
def count_call_alleles(
ds: Dataset, *, call_genotype: str = "call_genotype", merge: bool = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice change here to make variable name arguments (and merge) keyword only.

@mergify
Copy link
Contributor

mergify bot commented Sep 28, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Sep 28, 2020
@mergify mergify bot removed the conflict PR conflict label Sep 28, 2020
Copy link
Collaborator

@jeromekelleher jeromekelleher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

ndim: Union[None, int, Set[int]] = None


call_genotype = ArrayLikeSpec("call_genotype", kind="i", ndim=3)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @tomwhite here

@mergify
Copy link
Contributor

mergify bot commented Sep 30, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@ravwojdyla
Copy link
Collaborator Author

FYI rebased to remove the conflict.

@mergify mergify bot removed the conflict PR conflict label Sep 30, 2020
@mergify
Copy link
Contributor

mergify bot commented Oct 1, 2020

This PR has conflicts, @ravwojdyla please rebase and push updated version 🙏

@mergify mergify bot added the conflict PR conflict label Oct 1, 2020
Copy link
Collaborator

@eric-czech eric-czech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@jeromekelleher jeromekelleher added the auto-merge Auto merge label for mergify test flight label Oct 1, 2020
@jeromekelleher
Copy link
Collaborator

I've marked this auto-merge @ravwojdyla, should merge once you've cleared up the conflicts.

@mergify mergify bot removed the conflict PR conflict label Oct 2, 2020
@mergify mergify bot merged commit a510e12 into sgkit-dev:master Oct 2, 2020
@ravwojdyla ravwojdyla mentioned this pull request Oct 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants