[WIP] Docs describing the Genotype Call XArray #78

jerowe · 2020-07-30T14:02:46Z

I thought it would be nice to have some docs that conceptually describe what the Genotype Call Xarray Dataset is all about.

The intended audience is a scientist who is familiar with variant data, with at least some programming ability, but does not assume they are familiar with numpy, Xarray in general, or the Genotype Call Dataset format in particular.

jerowe · 2020-07-30T15:18:43Z

@alimanfoo @hammer I was going to do an example with a bed file, but the bed is in sgkit_plink. Should I not bother then?

hammer · 2020-07-30T15:49:07Z

Hey @jerowe thanks for the contribution!

We're still formulating the processes and tools we will use for documentation (https://github.com/pystatgen/sgkit/issues/19). Once our documentation is in place, we will write up some contributing guidelines (https://github.com/pystatgen/sgkit/issues/8).

Until those guidelines are ready, I think it would be best to either leave this PR open or stash these notebooks in a personal repo until we're ready to figure out where to put them.

Also, we'd prefer that every PR has an issue open that it can reference. That's something we plan to document in our contributing guidelines, so no worries for not having an issue for this PR!

jerowe · 2020-08-02T05:54:57Z

@hammer if you don't mind I'd prefer to keep these here, even as a placeholder. The PR itself is marked as a draft. I'm working on this with Quansight with @daletovar and @dharhas, so I don't have a personal repo. ;-)

I have no strong feelings regarding doc standards. I was on a call with @alimanfoo and he said it was a bit up in the air, so just go with notebooks for now.

I can certainly open an issue for this. ;-)

tomwhite · 2020-08-03T10:08:45Z

@jerowe this is great! I think documenting the xarray representation is core, so it's probably best to use reST format so it gets included in the main documentation. (I've set up Sphinx in this PR: https://github.com/pystatgen/sgkit/pull/51, not yet merged.) I'm thinking that it would be a bit like http://xarray.pydata.org/en/stable/data-structures.html.

For a longer tutorial, a Jupyter Notebook might be more appropriate so users can run the whole thing, but that could be a separate PR.

jeromekelleher · 2020-08-03T11:19:01Z

We should definitely get this key information documented as soon as we can, but I agree with @tomwhite that it should be done in rst as part of the sphinx docs. Hopefully we can get #51 merged ASAP, so that we're not blocking on getting the docs fleshed out.

jeromekelleher · 2020-08-03T11:22:08Z

@jerowe - do you think it's reasonable to convert the fundamental information here into rst/sphinx rather than a notebook? It would be really excellent to get this nailed down in solid documentation.

hammer · 2020-08-03T14:23:34Z

getting the docs fleshed out.

@jeromekelleher I'd be curious to hear why you perceive this to be something we should do right now. While I think API docs are great to have as we implement new functionality, I'd prefer to hold off on user docs until we have stabilized the package structure and core data structures.

jeromekelleher · 2020-08-03T14:54:58Z

@hammer - so having gone through the notebooks now (it's sooo annoying that GitHub doesn't show the actual notebooks in the PR!), I see what you mean, and these are more example driven and user-focused that we'd want to commit to right now.

I was imagining more high-level documentation describing the data structure layout and design philosophy, which I think would be very helpful at this stage where we're trying to onboard quite a few people. I'm increasingly a fan of documentation-driven-development, and I think we should try to document the core data structures now. Do you see where I'm coming from?

hammer · 2020-08-03T15:11:23Z

I was imagining more high-level documentation describing the data structure layout and design philosophy, which I think would be very helpful at this stage where we're trying to onboard quite a few people. I'm increasingly a fan of documentation-driven-development, and I think we should try to document the core data structures now. Do you see where I'm coming from?

Yes! And in fact one reason for not using Hail is their unwillingness to document their file format, so I want to be clear that I don't imagine postponing this sort of documentation for too long.

I think this sort of high-level documentation describing the data structure layout and design philosophy is best done by core developers rather than new users, and I will file an issue to be sure we get that documentation written up soon.

I perceive the documentation in this PR to be targeted at new users and I think it's great to have this sort of documentation written by people relatively new to the project, and don't want to block it forever. I just don't think we're at a place for the project where we can support new users, and so we should wait to have this sort of documentation until we at least have a PyPI release, for example. Also, I can imagine user-targeted notebooks might best live in a separate repo, similar to the Xarray tutorial linked above.

jeromekelleher · 2020-08-03T15:18:29Z

OK, thanks @hammer, I think we're on the same page then.

@jerowe - how would you like to proceed here? I think it would be useful to have fresh eyes on the internal data structures, but maybe it's best to get some basics down on paper first as @hammer says. But I agree that we're not really ready for the type of notebooks that are in this PR, and we'd probably want these in different, more tutorial-focused repo.

jerowe · 2020-08-03T16:25:36Z

@jeromekelleher @hammer (and other interested parties) I am totally fine with changing it to RST. I've had success going from notebook -> markdown and I'm sure I can find a markdown -> RST converter, or I can just rewrite the docs in RST.

jerowe · 2020-08-03T16:26:24Z

End result - I will convert these to RST and ping all involved.

hammer · 2020-08-03T17:01:09Z

@jerowe as noted in https://github.com/pystatgen/sgkit/pull/78#issuecomment-668076872 I will open 2 issues, one to discuss documentation of data structures and design philosophy, and one to discuss a tutorial for new users. I think your documentation is best targeted at the latter issue.

To ensure your efforts are not in vain, you may want to wait until we achieve a resolution on those issues, as it's not obvious to me we want to merge this documentation into this repository right now, nor is it obvious that this documentation is better as reST.

hammer · 2020-08-03T17:29:46Z

Okay I've filed #87 and #88 so that we can continue these two discussions on issues rather than a PR.

jerowe · 2020-08-03T18:20:11Z

I'm fine with whatever the sgkit hive minds decision is. ;-) I'm not working on this until Wednesday, but I'll pop back in and check on issues / PRs and to keep myself informed.

jerowe · 2020-08-03T18:24:25Z

Oh an done other quick note.

Jupyter notebooks can be imported into sphinx docs through an extension.

tomwhite · 2020-08-03T19:11:47Z

The distinction @hammer made between #87 (core docs in reST) and #88 (tutorial format TBD) looks good to me.

…type-dataset-docs

jerowe · 2020-08-05T10:20:08Z

Sooooo I can't change files without making the CI freak out?

jerowe · 2020-08-05T12:13:25Z

I think this sort of high-level documentation describing the data structure layout and design philosophy is best done by core developers rather than new users, and I will file an issue to be sure we get that documentation written up soon.

I personally find this kind of attitude really chases people away from contributing. Generally speaking, if someone wants to contribute docs on another way of understanding something and gets told not to worry their little head until the people in charge make the decisions they will go through the process once or twice, but at some point, they simply move on. At that point, the opportunity is gone and the project has missed out on very often valuable insight from a third party that probably has expertise the software engineers lack.

In this instance am I considered a core developer or a new user? I am perfectly capable of understanding a fancy dict that is itself mostly a transposed VCF file. ;-)

alimanfoo · 2020-08-05T13:36:10Z

Hi @jerowe, apologies for the radio silence, I'm still catching up from vacation. FWIW I really appreciated you submitting this PR, we will absolutely need documentation like this. Also apologies for not doing more to connect you with the other developers yet and get everyone aligned. There have been some recent discussions about how to organise documentation and those are ongoing, so it would be good to include you in shaping that.

I think this sort of high-level documentation describing the data structure layout and design philosophy is best done by core developers rather than new users, and I will file an issue to be sure we get that documentation written up soon.

I personally find this kind of attitude really chases people away from contributing. Generally speaking, if someone wants to contribute docs on another way of understanding something and gets told not to worry their little head until the people in charge make the decisions they will go through the process once or twice, but at some point, they simply move on.

I think there may have been some crossed wires here, I can see how the comment could be read that way, but I don't think that was the intention. Please give us the benefit of the doubt if you can, I know everyone in the project is committed to working collectively and supportively as a team.

hammer · 2020-08-05T20:45:27Z

@jerowe I apologize for making your first PR to sgkit a poor experience. Upon a review of our open issues, it's clear that this PR was a direct response to @alimanfoo's https://github.com/pystatgen/sgkit/issues/42. We've only been working together in this repo since July and are still learning how to coordinate our work. Your input on how we can work more effectively is appreciated, and I want to be sure you feel included in that "we" moving forward.

I've invited you and the rest of the Quansight team to a Zoom call on Friday and I hope we can discuss the best path forward then, as clearly I've done a poor job communicating asynchronously here on GitHub.

I'd absolutely like you and the rest of the Quansight team to be "core developers" as soon as possible! My clumsy use of that term in https://github.com/pystatgen/sgkit/issues/87 was meant to refer to @alimanfoo and @eric-czech who wrote the code that inspired the creation of sgkit with scikit-allel and Eric's xarray prototype. My intent was to have an issue to ensure we capture their thinking directly. I volunteered to do that myself as I thought it would be a good chance for me to make sure I understand things; if you'd like to work on that issue together, that would be great too.

To seed the discussion for Friday, here are some concrete reasons why I think this documentation may be hard to maintain over the next few weeks. I do hope things will stabilize a bit by then and you will be a part of these discussions to help us get the project in shape for user-oriented documentation.

Minimal Numpy example

https://github.com/pystatgen/sgkit/issues/17 and https://github.com/pystatgen/sgkit/issues/81 are examples of how we're still figuring out how to name dimensions and variables.
https://github.com/pystatgen/sgkit/issues/43 is an example of how we are still feeling out whether we want a class-based API, which will directly impact how sgkit datasets are constructed.
https://github.com/pystatgen/sgkit/issues/65 is a discussion of whether we should have a single or multiple repos, which may change how users install and import sgkit.

Load from MalariaGEN Zarr

https://github.com/pystatgen/sgkit/pull/40 has not been merged, so we don't yet have a way to read these files in sgkit.
https://github.com/pystatgen/sgkit/issues/56 has some potential improvements that may follow soon after https://github.com/pystatgen/sgkit/pull/40 if they are deemed important by Oxford/Quansight.

Load from VCF

We do not yet have a VCF reader, and we are all a bit intimidated by what it will take to get one working well, so documentation on how to read data directly from a VCF file may be out of scope for a bit longer than the first two.

jerowe · 2020-08-06T05:43:48Z

Hey @hammer, don't worry. I don't take anything through text too seriously, as is a general best practice on the internet. ;-) I would like to see sgkit avoid some of the pitfalls of other projects I've worked on, where it's impossible to contribute anything so people simply go elsewhere if they have time to volunteer.

I personally prefer to document early with lots of "hey its still early disclaimers". In my experience it tends to never get done otherwise, and if you change variable names I can switch those around with sed anyways.

With that said I can certainly see the other side that it's an extra thing to maintain.

I don't actually mind either way, because I tend to write these things up as I work through a problem or understanding a concept.

I'll leave this here for a bit and go work on other things.

eric-czech · 2020-08-06T11:45:57Z

Hi @jerowe, I had another thought for you on design philosophy as it relates to these examples and documentation in general. This is a bit reductive but there are two opposing perspectives that have come up often in the API discussions:

Have domain-specific data structures with appropriate functions
Use one domain-agnostic data structure and make our "API" a bunch of functions

Perspective 1 is certainly more common (e.g. scikit-allele) and I think it lends well to these kinds of examples as well as documentation formalized around data structure models.

Perspective 2 would suggest that all genetic data structures are some manifestation of an Xarray dataset and that there is nothing particularly special about a combination of variables like the one implied in create_genotype_call_dataset -- that just happens to be more or less what we get from bgen/plink files. It would also suggest that rather than writing documentation about data structures like in this PR, we instead hang all the documentation on data readers and genetics methods. This is what Hail and Glow do. A global list of variables and interpretations could be a nice addition though, kind of like a VCF spec.

I've been a proponent of perspective 2 for a number of reasons and one of them is that when you look at the individual quantities (i.e data variables) required across a collection of statgen methods like this, there isn't a clear domain-specific data structure (or smallish set of data structures) that will satisfy most or all of them without being inflexible for others.

tl;dr There are two competing design philosophies that would lead to two nearly orthogonal documentation strategies. One of them is easier to learn, provides more concise access to fewer capabilities, and has greater alignment with docs/examples like this. The other is easier to maintain, facilitates more flexible workflows, and more closely resembles recent genetics toolkits for distributed systems -- at the cost of being less friendly to contributors. I've thought about trying to attempt some documentation myself a few times and I end up dropping it because we should figure out how to support all of the above first in addition to what @hammer mentioned. Separating core and user APIs is a promising path forward IMO (https://github.com/pystatgen/sgkit/issues/25#issuecomment-656093123).

jerowe · 2020-08-07T15:52:05Z

Should I close this out or leave it up for historical purposes? There's some good discussions in here, but I don't think the PR is useful ATM.

Thoughts?

jeromekelleher · 2020-08-07T17:11:04Z

I think we can probably close the PR for tidyness sake, thanks @jerowe, but it'd be good to keep the branch around so that we can view the content later (I'm assuming github doesn't keep contents of a PR available if you delete the branch?).

jerowe · 2020-08-09T07:24:25Z

Branch stays, and the notebooks are all on @alimanfoo 's datalab instance anyways.

initial Genotype Call Dataset from VCF jupyterhub notebook

da28adc

jerowe self-assigned this Jul 30, 2020

jerowe marked this pull request as draft July 30, 2020 14:03

adding in more examples

a25187e

hammer mentioned this pull request Aug 3, 2020

Provide tutorial(s) to introduce new users to sgkit #88

Closed

jerowe added the documentation Improvements or additions to documentation label Aug 4, 2020

jerowe added 4 commits August 5, 2020 10:50

let's see what the rst looks like

41ca483

updating from zarr notebook

5da2398

Merge branch 'master' of github.com:pystatgen/sgkit into feature/geno…

4666611

…type-dataset-docs

adding in the docs in ipynb format

6987698

be happy, CI

135ae69

jerowe requested a review from tomwhite August 5, 2020 11:02

hammer mentioned this pull request Aug 7, 2020

Document data structures and design philosophy #87

Closed

jerowe closed this Aug 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Docs describing the Genotype Call XArray #78

[WIP] Docs describing the Genotype Call XArray #78

jerowe commented Jul 30, 2020

jerowe commented Jul 30, 2020

hammer commented Jul 30, 2020

jerowe commented Aug 2, 2020 •

edited

tomwhite commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

hammer commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

hammer commented Aug 3, 2020 •

edited

jeromekelleher commented Aug 3, 2020

jerowe commented Aug 3, 2020

jerowe commented Aug 3, 2020

hammer commented Aug 3, 2020

hammer commented Aug 3, 2020 •

edited

jerowe commented Aug 3, 2020

jerowe commented Aug 3, 2020

tomwhite commented Aug 3, 2020

jerowe commented Aug 5, 2020

jerowe commented Aug 5, 2020

alimanfoo commented Aug 5, 2020

hammer commented Aug 5, 2020

jerowe commented Aug 6, 2020

eric-czech commented Aug 6, 2020 •

edited

jerowe commented Aug 7, 2020

jeromekelleher commented Aug 7, 2020

jerowe commented Aug 9, 2020

[WIP] Docs describing the Genotype Call XArray #78

[WIP] Docs describing the Genotype Call XArray #78

Conversation

jerowe commented Jul 30, 2020

jerowe commented Jul 30, 2020

hammer commented Jul 30, 2020

jerowe commented Aug 2, 2020 • edited

tomwhite commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

hammer commented Aug 3, 2020

jeromekelleher commented Aug 3, 2020

hammer commented Aug 3, 2020 • edited

jeromekelleher commented Aug 3, 2020

jerowe commented Aug 3, 2020

jerowe commented Aug 3, 2020

hammer commented Aug 3, 2020

hammer commented Aug 3, 2020 • edited

jerowe commented Aug 3, 2020

jerowe commented Aug 3, 2020

tomwhite commented Aug 3, 2020

jerowe commented Aug 5, 2020

jerowe commented Aug 5, 2020

alimanfoo commented Aug 5, 2020

hammer commented Aug 5, 2020

Minimal Numpy example

Load from MalariaGEN Zarr

Load from VCF

jerowe commented Aug 6, 2020

eric-czech commented Aug 6, 2020 • edited

jerowe commented Aug 7, 2020

jeromekelleher commented Aug 7, 2020

jerowe commented Aug 9, 2020

jerowe commented Aug 2, 2020 •

edited

hammer commented Aug 3, 2020 •

edited

hammer commented Aug 3, 2020 •

edited

eric-czech commented Aug 6, 2020 •

edited