Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Docs describing the Genotype Call XArray #78

Closed
wants to merge 7 commits into from

Conversation

jerowe
Copy link

@jerowe jerowe commented Jul 30, 2020

I thought it would be nice to have some docs that conceptually describe what the Genotype Call Xarray Dataset is all about.

The intended audience is a scientist who is familiar with variant data, with at least some programming ability, but does not assume they are familiar with numpy, Xarray in general, or the Genotype Call Dataset format in particular.

@jerowe jerowe self-assigned this Jul 30, 2020
@jerowe jerowe marked this pull request as draft July 30, 2020 14:03
@jerowe
Copy link
Author

jerowe commented Jul 30, 2020

@alimanfoo @hammer I was going to do an example with a bed file, but the bed is in sgkit_plink. Should I not bother then?

@hammer
Copy link
Contributor

hammer commented Jul 30, 2020

Hey @jerowe thanks for the contribution!

We're still formulating the processes and tools we will use for documentation (https://github.com/pystatgen/sgkit/issues/19). Once our documentation is in place, we will write up some contributing guidelines (https://github.com/pystatgen/sgkit/issues/8).

Until those guidelines are ready, I think it would be best to either leave this PR open or stash these notebooks in a personal repo until we're ready to figure out where to put them.

Also, we'd prefer that every PR has an issue open that it can reference. That's something we plan to document in our contributing guidelines, so no worries for not having an issue for this PR!

@jerowe
Copy link
Author

jerowe commented Aug 2, 2020

@hammer if you don't mind I'd prefer to keep these here, even as a placeholder. The PR itself is marked as a draft. I'm working on this with Quansight with @daletovar and @dharhas, so I don't have a personal repo. ;-)

I have no strong feelings regarding doc standards. I was on a call with @alimanfoo and he said it was a bit up in the air, so just go with notebooks for now.

I can certainly open an issue for this. ;-)

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 3, 2020

@jerowe this is great! I think documenting the xarray representation is core, so it's probably best to use reST format so it gets included in the main documentation. (I've set up Sphinx in this PR: https://github.com/pystatgen/sgkit/pull/51, not yet merged.) I'm thinking that it would be a bit like http://xarray.pydata.org/en/stable/data-structures.html.

For a longer tutorial, a Jupyter Notebook might be more appropriate so users can run the whole thing, but that could be a separate PR.

@jeromekelleher
Copy link
Collaborator

We should definitely get this key information documented as soon as we can, but I agree with @tomwhite that it should be done in rst as part of the sphinx docs. Hopefully we can get #51 merged ASAP, so that we're not blocking on getting the docs fleshed out.

@jeromekelleher
Copy link
Collaborator

@jerowe - do you think it's reasonable to convert the fundamental information here into rst/sphinx rather than a notebook? It would be really excellent to get this nailed down in solid documentation.

@hammer
Copy link
Contributor

hammer commented Aug 3, 2020

getting the docs fleshed out.

@jeromekelleher I'd be curious to hear why you perceive this to be something we should do right now. While I think API docs are great to have as we implement new functionality, I'd prefer to hold off on user docs until we have stabilized the package structure and core data structures.

@jeromekelleher
Copy link
Collaborator

@hammer - so having gone through the notebooks now (it's sooo annoying that GitHub doesn't show the actual notebooks in the PR!), I see what you mean, and these are more example driven and user-focused that we'd want to commit to right now.

I was imagining more high-level documentation describing the data structure layout and design philosophy, which I think would be very helpful at this stage where we're trying to onboard quite a few people. I'm increasingly a fan of documentation-driven-development, and I think we should try to document the core data structures now. Do you see where I'm coming from?

@hammer
Copy link
Contributor

hammer commented Aug 3, 2020

I was imagining more high-level documentation describing the data structure layout and design philosophy, which I think would be very helpful at this stage where we're trying to onboard quite a few people. I'm increasingly a fan of documentation-driven-development, and I think we should try to document the core data structures now. Do you see where I'm coming from?

Yes! And in fact one reason for not using Hail is their unwillingness to document their file format, so I want to be clear that I don't imagine postponing this sort of documentation for too long.

I think this sort of high-level documentation describing the data structure layout and design philosophy is best done by core developers rather than new users, and I will file an issue to be sure we get that documentation written up soon.

I perceive the documentation in this PR to be targeted at new users and I think it's great to have this sort of documentation written by people relatively new to the project, and don't want to block it forever. I just don't think we're at a place for the project where we can support new users, and so we should wait to have this sort of documentation until we at least have a PyPI release, for example. Also, I can imagine user-targeted notebooks might best live in a separate repo, similar to the Xarray tutorial linked above.

@jeromekelleher
Copy link
Collaborator

OK, thanks @hammer, I think we're on the same page then.

@jerowe - how would you like to proceed here? I think it would be useful to have fresh eyes on the internal data structures, but maybe it's best to get some basics down on paper first as @hammer says. But I agree that we're not really ready for the type of notebooks that are in this PR, and we'd probably want these in different, more tutorial-focused repo.

@jerowe
Copy link
Author

jerowe commented Aug 3, 2020

@jeromekelleher @hammer (and other interested parties) I am totally fine with changing it to RST. I've had success going from notebook -> markdown and I'm sure I can find a markdown -> RST converter, or I can just rewrite the docs in RST.

@jerowe
Copy link
Author

jerowe commented Aug 3, 2020

End result - I will convert these to RST and ping all involved.

@hammer
Copy link
Contributor

hammer commented Aug 3, 2020

@jerowe as noted in https://github.com/pystatgen/sgkit/pull/78#issuecomment-668076872 I will open 2 issues, one to discuss documentation of data structures and design philosophy, and one to discuss a tutorial for new users. I think your documentation is best targeted at the latter issue.

To ensure your efforts are not in vain, you may want to wait until we achieve a resolution on those issues, as it's not obvious to me we want to merge this documentation into this repository right now, nor is it obvious that this documentation is better as reST.

@hammer
Copy link
Contributor

hammer commented Aug 3, 2020

Okay I've filed #87 and #88 so that we can continue these two discussions on issues rather than a PR.

@jerowe
Copy link
Author

jerowe commented Aug 3, 2020

I'm fine with whatever the sgkit hive minds decision is. ;-) I'm not working on this until Wednesday, but I'll pop back in and check on issues / PRs and to keep myself informed.

@jerowe
Copy link
Author

jerowe commented Aug 3, 2020

Oh an done other quick note.

Jupyter notebooks can be imported into sphinx docs through an extension.

@tomwhite
Copy link
Collaborator

tomwhite commented Aug 3, 2020

The distinction @hammer made between #87 (core docs in reST) and #88 (tutorial format TBD) looks good to me.

@jerowe jerowe added the documentation Improvements or additions to documentation label Aug 4, 2020
@jerowe
Copy link
Author

jerowe commented Aug 5, 2020

Sooooo I can't change files without making the CI freak out?

@jerowe jerowe requested a review from tomwhite August 5, 2020 11:02
@jerowe
Copy link
Author

jerowe commented Aug 5, 2020

I think this sort of high-level documentation describing the data structure layout and design philosophy is best done by core developers rather than new users, and I will file an issue to be sure we get that documentation written up soon.

I personally find this kind of attitude really chases people away from contributing. Generally speaking, if someone wants to contribute docs on another way of understanding something and gets told not to worry their little head until the people in charge make the decisions they will go through the process once or twice, but at some point, they simply move on. At that point, the opportunity is gone and the project has missed out on very often valuable insight from a third party that probably has expertise the software engineers lack.

In this instance am I considered a core developer or a new user? I am perfectly capable of understanding a fancy dict that is itself mostly a transposed VCF file. ;-)

@alimanfoo
Copy link
Collaborator

Hi @jerowe, apologies for the radio silence, I'm still catching up from vacation. FWIW I really appreciated you submitting this PR, we will absolutely need documentation like this. Also apologies for not doing more to connect you with the other developers yet and get everyone aligned. There have been some recent discussions about how to organise documentation and those are ongoing, so it would be good to include you in shaping that.

I think this sort of high-level documentation describing the data structure layout and design philosophy is best done by core developers rather than new users, and I will file an issue to be sure we get that documentation written up soon.

I personally find this kind of attitude really chases people away from contributing. Generally speaking, if someone wants to contribute docs on another way of understanding something and gets told not to worry their little head until the people in charge make the decisions they will go through the process once or twice, but at some point, they simply move on.

I think there may have been some crossed wires here, I can see how the comment could be read that way, but I don't think that was the intention. Please give us the benefit of the doubt if you can, I know everyone in the project is committed to working collectively and supportively as a team.

@hammer
Copy link
Contributor

hammer commented Aug 5, 2020

@jerowe I apologize for making your first PR to sgkit a poor experience. Upon a review of our open issues, it's clear that this PR was a direct response to @alimanfoo's https://github.com/pystatgen/sgkit/issues/42. We've only been working together in this repo since July and are still learning how to coordinate our work. Your input on how we can work more effectively is appreciated, and I want to be sure you feel included in that "we" moving forward.

I've invited you and the rest of the Quansight team to a Zoom call on Friday and I hope we can discuss the best path forward then, as clearly I've done a poor job communicating asynchronously here on GitHub.

I'd absolutely like you and the rest of the Quansight team to be "core developers" as soon as possible! My clumsy use of that term in https://github.com/pystatgen/sgkit/issues/87 was meant to refer to @alimanfoo and @eric-czech who wrote the code that inspired the creation of sgkit with scikit-allel and Eric's xarray prototype. My intent was to have an issue to ensure we capture their thinking directly. I volunteered to do that myself as I thought it would be a good chance for me to make sure I understand things; if you'd like to work on that issue together, that would be great too.

To seed the discussion for Friday, here are some concrete reasons why I think this documentation may be hard to maintain over the next few weeks. I do hope things will stabilize a bit by then and you will be a part of these discussions to help us get the project in shape for user-oriented documentation.

Minimal Numpy example

Load from MalariaGEN Zarr

Load from VCF

  • We do not yet have a VCF reader, and we are all a bit intimidated by what it will take to get one working well, so documentation on how to read data directly from a VCF file may be out of scope for a bit longer than the first two.

@jerowe
Copy link
Author

jerowe commented Aug 6, 2020

Hey @hammer, don't worry. I don't take anything through text too seriously, as is a general best practice on the internet. ;-) I would like to see sgkit avoid some of the pitfalls of other projects I've worked on, where it's impossible to contribute anything so people simply go elsewhere if they have time to volunteer.

I personally prefer to document early with lots of "hey its still early disclaimers". In my experience it tends to never get done otherwise, and if you change variable names I can switch those around with sed anyways.

With that said I can certainly see the other side that it's an extra thing to maintain.

I don't actually mind either way, because I tend to write these things up as I work through a problem or understanding a concept.

I'll leave this here for a bit and go work on other things.

@eric-czech
Copy link
Collaborator

eric-czech commented Aug 6, 2020

Hi @jerowe, I had another thought for you on design philosophy as it relates to these examples and documentation in general. This is a bit reductive but there are two opposing perspectives that have come up often in the API discussions:

  1. Have domain-specific data structures with appropriate functions
  2. Use one domain-agnostic data structure and make our "API" a bunch of functions

Perspective 1 is certainly more common (e.g. scikit-allele) and I think it lends well to these kinds of examples as well as documentation formalized around data structure models.

Perspective 2 would suggest that all genetic data structures are some manifestation of an Xarray dataset and that there is nothing particularly special about a combination of variables like the one implied in create_genotype_call_dataset -- that just happens to be more or less what we get from bgen/plink files. It would also suggest that rather than writing documentation about data structures like in this PR, we instead hang all the documentation on data readers and genetics methods. This is what Hail and Glow do. A global list of variables and interpretations could be a nice addition though, kind of like a VCF spec.

I've been a proponent of perspective 2 for a number of reasons and one of them is that when you look at the individual quantities (i.e data variables) required across a collection of statgen methods like this, there isn't a clear domain-specific data structure (or smallish set of data structures) that will satisfy most or all of them without being inflexible for others.

tl;dr There are two competing design philosophies that would lead to two nearly orthogonal documentation strategies. One of them is easier to learn, provides more concise access to fewer capabilities, and has greater alignment with docs/examples like this. The other is easier to maintain, facilitates more flexible workflows, and more closely resembles recent genetics toolkits for distributed systems -- at the cost of being less friendly to contributors. I've thought about trying to attempt some documentation myself a few times and I end up dropping it because we should figure out how to support all of the above first in addition to what @hammer mentioned. Separating core and user APIs is a promising path forward IMO (https://github.com/pystatgen/sgkit/issues/25#issuecomment-656093123).

@jerowe
Copy link
Author

jerowe commented Aug 7, 2020

Should I close this out or leave it up for historical purposes? There's some good discussions in here, but I don't think the PR is useful ATM.

Thoughts?

@jeromekelleher
Copy link
Collaborator

I think we can probably close the PR for tidyness sake, thanks @jerowe, but it'd be good to keep the branch around so that we can view the content later (I'm assuming github doesn't keep contents of a PR available if you delete the branch?).

@jerowe
Copy link
Author

jerowe commented Aug 9, 2020

Branch stays, and the notebooks are all on @alimanfoo 's datalab instance anyways.

@jerowe jerowe closed this Aug 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

6 participants