Update: TERRA Workflow Documentation #999

j23414 · 2022-09-09T00:04:08Z

Description of proposed changes

Since occasional questions about our Terra workflows have been raised during office hours or via DM's, the goal is to update documentation of 3 WDL Terra workflows:

ncov:master - run the basic ncov workflow
ncov:wdl/genbank_ingest - pull a public dataset and send them through our preprocessing scripts.
ncov:wdl/gisaid_ingest - pull a private dataset if a user has their own API key, account, and password. Mostly to make available our preprocessing scripts.

The workflows are separated so that only parameters specific to a particular usecase are shown in Terra.

Related issue(s)

Related to #https://github.com/nextstrain/private/issues/11

Testing

Since this PR should only change the rst files (documentation), testing mainly consist of ensuring that the documents build. Any problematic steps are being fixed if/when they arise with each onboarded user. Comments or suggestions are still welcome though.

Linking the changed pages for ease of proof-reading:

Or the live draft pages:

docs/src/guides/run-ingest-on-terra.rst

docs/src/guides/run-analysis-on-terra.rst

victorlin

I didn't run through the steps, but added some suggestions for wording/styling.

docs/src/guides/run-ingest-on-terra.rst

huddlej · 2022-09-12T21:46:34Z

docs/src/guides/run-analysis-on-terra.rst

@@ -64,8 +70,36 @@ Connect your data files to the WDL workflow
  |Nextstrain_WRKFLW|  sequence_fasta  | File  | this.sequences       |
  +-----------------+------------------+-------+----------------------+

-10. Click on the **OUTPUTS** tab
-11. Connect your generated output back to the data table, but filling in values:
+10. If creating a build with multiple sequence and metadata files, can upload a targz folder containing the files. Otherwise skip


This seems like a really cool feature! What if you included an example here of what the tar archive's structure should be, so people have a sense of how to prepare this file prior to upload? The expression "targz folder" makes me think that I would need to place my files in a directory and then tar and gzip that directory. Is that correct?

We should also figure out how we want to consistently refer to compressed tar archives throughout the docs. I know what you mean by "targz folder" but we sometimes use "tarball" or "tar archive", etc.

Thank you for the feedback! And yes, that is correct!

I'm thinking through how to add an example of the tarball structure. Originally, the tarball contextual sequences feature was added by this request.

I'll probably add it as a part 2... maybe something like:

Now that you've worked through the minimal case, here is an example of adding contextual sequences and a configfile in a tarball. The structure of the directory must be ...

I'll think through it a bit more

Could we move:

Now that you've worked through the minimal case, here is an example of adding contextual sequences and a configfile in a tarball. The structure of the directory must be ...

As something to-be-done in a future PR?

docs/src/guides/run-analysis-on-terra.rst

docs/src/guides/run-ingest-on-terra.rst

victorlin

Still walking through this as noted on Slack. Sending in some small comments for now.

docs/src/guides/run-ingest-on-terra.rst

Document 3 WDL workflows to be run on Terra: 1. ncov/ncov - run the basic ncov workflow 2. ncov/genbank_ingest - pull a public dataset and send them through our preprocessing scripts. 3. ncov/gisaid_ingest - pull a private dataset if a user has their own API key, account, and password. Mostly to make available our preprocessing scripts. The workflows are separated so that only parameters specific to a particular usecase are shown in Terra. Apply suggestions from code review related to wording and spelling. * styling fixes * update docs to reflect observational differences Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com> Co-authored-by: Jover Lee <joverlee521@gmail.com>

victorlin · 2022-12-07T18:21:39Z

docs/src/index.rst

+   guides/run-genbank-ingest-on-terra   
+   guides/run-gisaid-ingest-on-terra  


Suggestion: have a parent page Ingest SARS-CoV-2 Data on Terra which contains the intro text that is currently duplicated. You can then link to the sub-pages using a .. toctree:: on that page, and the hierarchy should also be reflected in the main sidebar.

URLs might look like:

projects/ncov/en/latest/guides/ingest-data-on-terra

projects/ncov/en/latest/guides/ingest-data-on-terra/genbank

projects/ncov/en/latest/guides/ingest-data-on-terra/gisaid

victorlin

Adding some comments from a new look at these docs, and another pass through the second section of the GenBank ingest page.

docs/src/guides/run-genbank-ingest-on-terra.rst

docs/src/guides/run-gisaid-ingest-on-terra.rst

docs/src/guides/run-genbank-ingest-on-terra.rst

victorlin · 2022-12-27T18:15:45Z

docs/src/guides/run-genbank-ingest-on-terra.rst

+Connect any workspace variables to the wdl ingest workflow
+===========================================================


The steps in this section work, which is great! But, as someone who is unfamiliar with Terra, I don't really understand what I'm doing here. Some lingering questions:

What does "root entity type" mean, and what is ncov_examples?

Select the 1st row in the data table. The first column should have value blank. Selecting more rows will cause the workflow to run more than once.

This works, but seems hacky. What does blank mean?

Why are there other rows available if we aren't selecting them? Are those for other use cases?

For the OUTPUTS tab:

Why can't I just use the default values this.*?

What's the significance of using workspace.*?

What about the last_run variable? Should this be set for proper caching?

Once submitted, there is a table with column "Data Entity" and the submission has the value "blank (ncov_examples)". What does this mean? I assume understanding (1) and (2) will help here, but I also wonder if the value can be more descriptive (e.g. default instead of blank).

victorlin · 2022-12-27T18:17:25Z

docs/src/guides/run-genbank-ingest-on-terra.rst

+4. Under **Step 2**:
+
+  1. Click **SELECT DATA**.
+  2. Select **Choose specific ncov_examples to process**.


This shows on Terra as "Choose specific ncov_exampless to process". It looks weird, can it be renamed so there is only one plural s?

Yes, would ncov_example_set work?

That way, Terra will display "Choose specific ncov_example_sets to process". I'm open to other suggestions.

I don't understand what a "set" is in Terra, but I'm not sure we want to be using the _set suffix. I noticed that most data tables created in our Development workspace are accompanied by another table with the same name suffixed with _set. This other table might be auto-created when data is selected in the process of running a workflow.

I just created a data table victorlin_test_set and noticed that it generated an additional column victorlin_tests. This doesn't happen without the _set suffix.

There's a doc page on set tables which seems relevant.

docs/src/guides/run-genbank-ingest-on-terra.rst

victorlin · 2022-12-27T18:25:52Z

docs/src/guides/run-genbank-ingest-on-terra.rst

+Import the GenBank ingest wdl workflow from Dockstore
+======================================================


Can the section titles be simplified to "Import the workflow" and "Run the workflow"?

docs/src/guides/run-genbank-ingest-on-terra.rst

Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>

Keep link URLs inline rather than referenced separately at the bottom, unless it's used multiple times. #999 (comment)

Add a note that this is a one-time "setup" per user. Users will want to run the workflow more than once, but it's not necessary to follow the initial steps more than once.

Preface Terra ingest instructions with a warning on run time and cost, so users know what to expect and can assess if they have the time/budget to follow these steps

The GISAID Ingest on Terra requires the user to have access to an API endpoint. Direct them to the GISAID webpage.

Provide an example Data Table tsv file to upload to Terra.

victorlin

Some comments from my first pass through the run-analysis-on-terra page.

victorlin · 2022-12-28T22:22:53Z

docs/src/guides/run-genbank-ingest-on-terra.rst

+2. Click on the radio button **Run workflow(s) with inputs defined by data table**.
+3. Under **Step 1**:
+
+  1. Select root entity type as **ncov_examples** from the drop down menu.


Is this the ncov_examples data table that is created in the run-analysis-on-terra guide? If so, this creates a chicken-egg problem – I'd imagine users would want to run ingest first, followed by the workflow that produces builds.

Suggestion: Put the data table creation in a new terra-workspace-setup page which would then be a pre-requisite for all the guides here.

I'd imagine users would want to run ingest first, followed by the workflow that produces builds.

We need to clarify this. @huddlej? I'm expecting users to first encounter "run-analysis-on-terra" similar to our SARS-CoV-2 Workflow documentation, where they upload example metadata and sequences files.

https://docs.nextstrain.org/projects/ncov/en/latest/tutorial/example-data.html

After which, the user learns how to pull and include their own context sequences (equivalent to run-XXX-ingest-on-terra:

https://docs.nextstrain.org/projects/ncov/en/latest/tutorial/genomic-surveillance.html

But this is a good point, I will reword the beginning of this tutorial to explicitly recommend tutorial order. I think we (or maybe just me) have gotten confused because this tutorial was also trying to address a very specific need (where that individual group would need ingest, then ncov).

victorlin · 2022-12-28T22:24:59Z

docs/src/guides/run-analysis-on-terra.rst


 Connect your data files to the WDL workflow
 ===========================================

 1. On the **DATA** tab, click on **+** next to the **TABLES** section to create a Data Table
-#. Download the "sample_template.tsv" file
-#. Create a tab delimited file similar to below:
+2. Download the "sample_template.tsv" file


Suggested change

2. Download the "sample_template.tsv" file

This file is unused, and there is already a link to download another sample file that is more appropriate for this guide.

Yes, agree. I only recently added the "download another sample file" in 7410bc5 but should have made the earlier instructions a note (or link to Terra docs).

Thanks for the feedback! I probably need to link and/or summarize these two Terra concepts:

Data Tables - https://support.terra.bio/hc/en-us/articles/360025758392

Workspace Data - https://support.terra.bio/hc/en-us/articles/4417296435483-How-to-add-workspace-level-data-workspace-data-table-

victorlin · 2022-12-28T22:30:44Z

docs/src/guides/run-analysis-on-terra.rst

+3. Create a tab delimited file similar to below :download:`example script <./ncov_examples.tsv>`:

 ::

    entity:ncov_examples_id	metadata	sequences	configfile_yaml
+    blank   
    example	gs://COPY_PATH_HERE/example_metadata.tsv	gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz
    example_build		gs://COPY_PATH_HERE/example-build.yaml


Suggestion: either provide just a download link, e.g.

Suggested change

3. Create a tab delimited file similar to below :download:`example script <./ncov_examples.tsv>`:

::

entity:ncov_examples_id metadata sequences configfile_yaml

blank

example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz

example_build gs://COPY_PATH_HERE/example-build.yaml

3. Download :download:`ncov_examples.tsv <./ncov_examples.tsv>`.

or just the text block (there is a handy copy-to-clipboard button on the rendered docs page). If keeping the text block, make sure it's using tabs and not spaces.

victorlin · 2022-12-28T22:48:35Z

docs/src/guides/run-analysis-on-terra.rst

+  +-----------------+-----------------------+--------+--------------------------------+
+
+13. Click on the **OUTPUTS** tab
+14. Connect your generated output back to the data table, but filling in values:


Looks like these are the default values. Suggestion:

Suggested change

14. Connect your generated output back to the data table, but filling in values:

14. Click **Use defaults** next to **Attributes** (the last column).

and remove the markdown table below.

victorlin · 2022-12-28T22:57:05Z

docs/src/guides/run-analysis-on-terra.rst


 ::

    entity:ncov_examples_id	metadata	sequences	configfile_yaml
+    blank   


I don't understand how this links the ingest workflow output (workspace data files?) to this workflow. Are the gs:// paths still necessary here?

victorlin · 2022-12-28T23:06:20Z

docs/src/guides/run-analysis-on-terra.rst

    example	gs://COPY_PATH_HERE/example_metadata.tsv	gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz
    example_build		gs://COPY_PATH_HERE/example-build.yaml


Suggestion: Add instructions on how to populate these paths. My assumptions:

configfile_yaml: This should reference a workflow config file uploaded to workspace data files. The gs:// path can be copied from links on the Workspace Data Files view.

metadata: This should reference a metadata TSV file. (How do you reference the ingest workflow result, is it something like workspace.genbank_metadata_tsv?)

sequences: This should reference a sequences FASTA file.

j23414 force-pushed the wdl/docs branch 3 times, most recently from 5f805fe to c81a837 Compare September 12, 2022 17:08

huddlej reviewed Sep 12, 2022

View reviewed changes

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

huddlej reviewed Sep 12, 2022

View reviewed changes

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

huddlej reviewed Sep 12, 2022

View reviewed changes

docs/src/guides/run-analysis-on-terra.rst Outdated Show resolved Hide resolved

victorlin reviewed Sep 12, 2022

View reviewed changes

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

huddlej reviewed Sep 12, 2022

View reviewed changes

docs/src/guides/run-analysis-on-terra.rst Outdated Show resolved Hide resolved

joverlee521 reviewed Sep 12, 2022

View reviewed changes

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

joverlee521 reviewed Sep 12, 2022

View reviewed changes

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

j23414 force-pushed the wdl/docs branch 3 times, most recently from 9e6e730 to 8ebd207 Compare September 12, 2022 22:41

This was referenced Sep 26, 2022

feat: WDL overhaul for Dockstore and Terra #1005

Open

feat: WDL Script Overhaul and Squash #1006

Merged

j23414 force-pushed the wdl/docs branch from 8ebd207 to d4e41f8 Compare September 28, 2022 20:24

victorlin reviewed Oct 27, 2022

View reviewed changes

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

docs/src/guides/run-ingest-on-terra.rst Outdated Show resolved Hide resolved

j23414 mentioned this pull request Oct 27, 2022

fix: select_first was called with 1 empty values #1024

Merged

j23414 force-pushed the wdl/docs branch 5 times, most recently from 66193bf to 5cf5546 Compare November 3, 2022 22:10

j23414 force-pushed the wdl/docs branch from 5cf5546 to 721f4a0 Compare November 3, 2022 22:25

wdl docs: split run ingest into genbank and gisaid

fcb4ad6

j23414 force-pushed the wdl/docs branch from 8f3d59b to fcb4ad6 Compare December 6, 2022 21:03

victorlin reviewed Dec 7, 2022

View reviewed changes

victorlin reviewed Dec 27, 2022

View reviewed changes

Jennifer Chang and others added 8 commits December 27, 2022 12:36

Update docs/src/guides/run-genbank-ingest-on-terra.rst

9cc72df

Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>

Keep link URLs inline

c91612e

Keep link URLs inline rather than referenced separately at the bottom, unless it's used multiple times. #999 (comment)

Add a note that this is a one-time "setup" per user

c4dd568

Add a note that this is a one-time "setup" per user. Users will want to run the workflow more than once, but it's not necessary to follow the initial steps more than once.

Add warning on runtime and cost

364f5c3

Preface Terra ingest instructions with a warning on run time and cost, so users know what to expect and can assess if they have the time/budget to follow these steps

Update Terra task names

f903b70

Move api instructions to a warning block

5d4fc05

The GISAID Ingest on Terra requires the user to have access to an API endpoint. Direct them to the GISAID webpage.

Provide an example data table file

7410bc5

Provide an example Data Table tsv file to upload to Terra.

formatting changes

9b572ac

victorlin reviewed Dec 28, 2022

View reviewed changes

j23414 marked this pull request as draft April 14, 2023 19:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update: TERRA Workflow Documentation #999

Update: TERRA Workflow Documentation #999

j23414 commented Sep 9, 2022 •

edited by victorlin

victorlin left a comment

huddlej Sep 12, 2022

j23414 Sep 12, 2022

j23414 Dec 6, 2022

victorlin left a comment

victorlin Dec 7, 2022

victorlin left a comment

victorlin Dec 27, 2022 •

edited

victorlin Dec 27, 2022

j23414 Dec 30, 2022

victorlin Dec 30, 2022

victorlin Dec 27, 2022

victorlin left a comment

victorlin Dec 28, 2022

j23414 Dec 30, 2022 •

edited

victorlin Dec 28, 2022

j23414 Dec 30, 2022 •

edited

victorlin Dec 28, 2022

victorlin Dec 28, 2022

victorlin Dec 28, 2022

victorlin Dec 28, 2022

		guides/run-genbank-ingest-on-terra
		guides/run-gisaid-ingest-on-terra

		Connect any workspace variables to the wdl ingest workflow
		===========================================================

		Import the GenBank ingest wdl workflow from Dockstore
		======================================================

	14. Connect your generated output back to the data table, but filling in values:
	14. Click Use defaults next to Attributes (the last column).

		example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz
		example_build gs://COPY_PATH_HERE/example-build.yaml

Update: TERRA Workflow Documentation #999

Are you sure you want to change the base?

Update: TERRA Workflow Documentation #999

Conversation

j23414 commented Sep 9, 2022 • edited by victorlin

Description of proposed changes

Related issue(s)

Testing

victorlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

victorlin Dec 27, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

victorlin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j23414 Dec 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j23414 Dec 30, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

j23414 commented Sep 9, 2022 •

edited by victorlin

victorlin Dec 27, 2022 •

edited

j23414 Dec 30, 2022 •

edited

j23414 Dec 30, 2022 •

edited