Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update: TERRA Workflow Documentation #999

Draft
wants to merge 10 commits into
base: master
Choose a base branch
from
Draft

Update: TERRA Workflow Documentation #999

wants to merge 10 commits into from

Conversation

j23414
Copy link

@j23414 j23414 commented Sep 9, 2022

Description of proposed changes

Since occasional questions about our Terra workflows have been raised during office hours or via DM's, the goal is to update documentation of 3 WDL Terra workflows:

  1. ncov:master - run the basic ncov workflow
  2. ncov:wdl/genbank_ingest - pull a public dataset and send them through our preprocessing scripts.
  3. ncov:wdl/gisaid_ingest - pull a private dataset if a user has their own API key, account, and password. Mostly to make available our preprocessing scripts.

The workflows are separated so that only parameters specific to a particular usecase are shown in Terra.

Related issue(s)

Related to #https://github.com/nextstrain/private/issues/11

Testing

Since this PR should only change the rst files (documentation), testing mainly consist of ensuring that the documents build. Any problematic steps are being fixed if/when they arise with each onboarded user. Comments or suggestions are still welcome though.

Linking the changed pages for ease of proof-reading:

Or the live draft pages:

@j23414 j23414 force-pushed the wdl/docs branch 3 times, most recently from 5f805fe to c81a837 Compare September 12, 2022 17:08
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't run through the steps, but added some suggestions for wording/styling.

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved
docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved
docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved
@@ -64,8 +70,36 @@ Connect your data files to the WDL workflow
|Nextstrain_WRKFLW| sequence_fasta | File | this.sequences |
+-----------------+------------------+-------+----------------------+

10. Click on the **OUTPUTS** tab
11. Connect your generated output back to the data table, but filling in values:
10. If creating a build with multiple sequence and metadata files, can upload a targz folder containing the files. Otherwise skip
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a really cool feature! What if you included an example here of what the tar archive's structure should be, so people have a sense of how to prepare this file prior to upload? The expression "targz folder" makes me think that I would need to place my files in a directory and then tar and gzip that directory. Is that correct?

We should also figure out how we want to consistently refer to compressed tar archives throughout the docs. I know what you mean by "targz folder" but we sometimes use "tarball" or "tar archive", etc.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback! And yes, that is correct!

I'm thinking through how to add an example of the tarball structure. Originally, the tarball contextual sequences feature was added by this request.

I'll probably add it as a part 2... maybe something like:

Now that you've worked through the minimal case, here is an example of adding contextual sequences and a configfile in a tarball. The structure of the directory must be ...

I'll think through it a bit more

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we move:

Now that you've worked through the minimal case, here is an example of adding contextual sequences and a configfile in a tarball. The structure of the directory must be ...

As something to-be-done in a future PR?

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still walking through this as noted on Slack. Sending in some small comments for now.

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved
docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved
Document 3 WDL workflows to be run on Terra:

1. ncov/ncov - run the basic ncov workflow
2. ncov/genbank_ingest - pull a public dataset and send them through our preprocessing scripts.
3. ncov/gisaid_ingest - pull a private dataset if a user has their own API key, account, and password. Mostly to make available our preprocessing scripts.

The workflows are separated so that only parameters specific to a particular usecase are shown in Terra.
Apply suggestions from code review related to wording and spelling.

* styling fixes
* update docs to reflect observational differences

Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>
Co-authored-by: Jover Lee <joverlee521@gmail.com>
Comment on lines +49 to +50
guides/run-genbank-ingest-on-terra
guides/run-gisaid-ingest-on-terra
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: have a parent page Ingest SARS-CoV-2 Data on Terra which contains the intro text that is currently duplicated. You can then link to the sub-pages using a .. toctree:: on that page, and the hierarchy should also be reflected in the main sidebar.

URLs might look like:

  • projects/ncov/en/latest/guides/ingest-data-on-terra
  • projects/ncov/en/latest/guides/ingest-data-on-terra/genbank
  • projects/ncov/en/latest/guides/ingest-data-on-terra/gisaid

Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some comments from a new look at these docs, and another pass through the second section of the GenBank ingest page.

docs/src/guides/run-genbank-ingest-on-terra.rst Outdated Show resolved Hide resolved
docs/src/guides/run-genbank-ingest-on-terra.rst Outdated Show resolved Hide resolved
docs/src/guides/run-gisaid-ingest-on-terra.rst Outdated Show resolved Hide resolved
Comment on lines +30 to +31
Connect any workspace variables to the wdl ingest workflow
===========================================================
Copy link
Member

@victorlin victorlin Dec 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The steps in this section work, which is great! But, as someone who is unfamiliar with Terra, I don't really understand what I'm doing here. Some lingering questions:

  1. What does "root entity type" mean, and what is ncov_examples?
  2. Select the 1st row in the data table. The first column should have value blank. Selecting more rows will cause the workflow to run more than once.

    1. This works, but seems hacky. What does blank mean?
    2. Why are there other rows available if we aren't selecting them? Are those for other use cases?
  3. For the OUTPUTS tab:
    1. Why can't I just use the default values this.*?
    2. What's the significance of using workspace.*?
    3. What about the last_run variable? Should this be set for proper caching?
  4. Once submitted, there is a table with column "Data Entity" and the submission has the value "blank (ncov_examples)". What does this mean? I assume understanding (1) and (2) will help here, but I also wonder if the value can be more descriptive (e.g. default instead of blank).

4. Under **Step 2**:

1. Click **SELECT DATA**.
2. Select **Choose specific ncov_examples to process**.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shows on Terra as "Choose specific ncov_exampless to process". It looks weird, can it be renamed so there is only one plural s?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, would ncov_example_set work?

That way, Terra will display "Choose specific ncov_example_sets to process". I'm open to other suggestions.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what a "set" is in Terra, but I'm not sure we want to be using the _set suffix. I noticed that most data tables created in our Development workspace are accompanied by another table with the same name suffixed with _set. This other table might be auto-created when data is selected in the process of running a workflow.

I just created a data table victorlin_test_set and noticed that it generated an additional column victorlin_tests. This doesn't happen without the _set suffix.

There's a doc page on set tables which seems relevant.

docs/src/guides/run-genbank-ingest-on-terra.rst Outdated Show resolved Hide resolved
Comment on lines +16 to +17
Import the GenBank ingest wdl workflow from Dockstore
======================================================
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can the section titles be simplified to "Import the workflow" and "Run the workflow"?

Jennifer Chang and others added 8 commits December 27, 2022 12:36
Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>
Keep link URLs inline rather than referenced separately at the bottom, unless it's used multiple times.
#999 (comment)
Add a note that this is a one-time "setup" per user. Users will want to run the workflow more than once, but it's not necessary to follow the initial steps more than once.
Preface Terra ingest instructions with a warning on run time and cost, so users know what to expect and can assess if they have the time/budget to follow these steps
The GISAID Ingest on Terra requires the user to have access to an API endpoint. Direct them to the GISAID webpage.
Provide an example Data Table tsv file to upload to Terra.
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments from my first pass through the run-analysis-on-terra page.

2. Click on the radio button **Run workflow(s) with inputs defined by data table**.
3. Under **Step 1**:

1. Select root entity type as **ncov_examples** from the drop down menu.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the ncov_examples data table that is created in the run-analysis-on-terra guide? If so, this creates a chicken-egg problem – I'd imagine users would want to run ingest first, followed by the workflow that produces builds.

Suggestion: Put the data table creation in a new terra-workspace-setup page which would then be a pre-requisite for all the guides here.

Copy link
Author

@j23414 j23414 Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd imagine users would want to run ingest first, followed by the workflow that produces builds.

We need to clarify this. @huddlej? I'm expecting users to first encounter "run-analysis-on-terra" similar to our SARS-CoV-2 Workflow documentation, where they upload example metadata and sequences files.

After which, the user learns how to pull and include their own context sequences (equivalent to run-XXX-ingest-on-terra:

But this is a good point, I will reword the beginning of this tutorial to explicitly recommend tutorial order. I think we (or maybe just me) have gotten confused because this tutorial was also trying to address a very specific need (where that individual group would need ingest, then ncov).


Connect your data files to the WDL workflow
===========================================

1. On the **DATA** tab, click on **+** next to the **TABLES** section to create a Data Table
#. Download the "sample_template.tsv" file
#. Create a tab delimited file similar to below:
2. Download the "sample_template.tsv" file
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
2. Download the "sample_template.tsv" file

This file is unused, and there is already a link to download another sample file that is more appropriate for this guide.

Copy link
Author

@j23414 j23414 Dec 30, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, agree. I only recently added the "download another sample file" in 7410bc5 but should have made the earlier instructions a note (or link to Terra docs).

Thanks for the feedback! I probably need to link and/or summarize these two Terra concepts:

Comment on lines +40 to 47
3. Create a tab delimited file similar to below :download:`example script <./ncov_examples.tsv>`:

::

entity:ncov_examples_id metadata sequences configfile_yaml
blank
example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz
example_build gs://COPY_PATH_HERE/example-build.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: either provide just a download link, e.g.

Suggested change
3. Create a tab delimited file similar to below :download:`example script <./ncov_examples.tsv>`:
::
entity:ncov_examples_id metadata sequences configfile_yaml
blank
example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz
example_build gs://COPY_PATH_HERE/example-build.yaml
3. Download :download:`ncov_examples.tsv <./ncov_examples.tsv>`.

or just the text block (there is a handy copy-to-clipboard button on the rendered docs page). If keeping the text block, make sure it's using tabs and not spaces.

+-----------------+-----------------------+--------+--------------------------------+

13. Click on the **OUTPUTS** tab
14. Connect your generated output back to the data table, but filling in values:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these are the default values. Suggestion:

Suggested change
14. Connect your generated output back to the data table, but filling in values:
14. Click **Use defaults** next to **Attributes** (the last column).

and remove the markdown table below.


::

entity:ncov_examples_id metadata sequences configfile_yaml
blank
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand how this links the ingest workflow output (workspace data files?) to this workflow. Are the gs:// paths still necessary here?

Comment on lines 46 to 47
example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz
example_build gs://COPY_PATH_HERE/example-build.yaml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add instructions on how to populate these paths. My assumptions:

  1. configfile_yaml: This should reference a workflow config file uploaded to workspace data files. The gs:// path can be copied from links on the Workspace Data Files view.
  2. metadata: This should reference a metadata TSV file. (How do you reference the ingest workflow result, is it something like workspace.genbank_metadata_tsv?)
  3. sequences: This should reference a sequences FASTA file.

@j23414 j23414 marked this pull request as draft April 14, 2023 19:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

None yet

4 participants