New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update: TERRA Workflow Documentation #999
base: master
Are you sure you want to change the base?
Conversation
5f805fe
to
c81a837
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't run through the steps, but added some suggestions for wording/styling.
@@ -64,8 +70,36 @@ Connect your data files to the WDL workflow | |||
|Nextstrain_WRKFLW| sequence_fasta | File | this.sequences | | |||
+-----------------+------------------+-------+----------------------+ | |||
|
|||
10. Click on the **OUTPUTS** tab | |||
11. Connect your generated output back to the data table, but filling in values: | |||
10. If creating a build with multiple sequence and metadata files, can upload a targz folder containing the files. Otherwise skip |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really cool feature! What if you included an example here of what the tar archive's structure should be, so people have a sense of how to prepare this file prior to upload? The expression "targz folder" makes me think that I would need to place my files in a directory and then tar and gzip that directory. Is that correct?
We should also figure out how we want to consistently refer to compressed tar archives throughout the docs. I know what you mean by "targz folder" but we sometimes use "tarball" or "tar archive", etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the feedback! And yes, that is correct!
I'm thinking through how to add an example of the tarball structure. Originally, the tarball contextual sequences feature was added by this request.
I'll probably add it as a part 2... maybe something like:
Now that you've worked through the minimal case, here is an example of adding contextual sequences and a configfile in a tarball. The structure of the directory must be ...
I'll think through it a bit more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we move:
Now that you've worked through the minimal case, here is an example of adding contextual sequences and a configfile in a tarball. The structure of the directory must be ...
As something to-be-done in a future PR?
9e6e730
to
8ebd207
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still walking through this as noted on Slack. Sending in some small comments for now.
66193bf
to
5cf5546
Compare
Document 3 WDL workflows to be run on Terra: 1. ncov/ncov - run the basic ncov workflow 2. ncov/genbank_ingest - pull a public dataset and send them through our preprocessing scripts. 3. ncov/gisaid_ingest - pull a private dataset if a user has their own API key, account, and password. Mostly to make available our preprocessing scripts. The workflows are separated so that only parameters specific to a particular usecase are shown in Terra. Apply suggestions from code review related to wording and spelling. * styling fixes * update docs to reflect observational differences Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com> Co-authored-by: Jover Lee <joverlee521@gmail.com>
guides/run-genbank-ingest-on-terra | ||
guides/run-gisaid-ingest-on-terra |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: have a parent page Ingest SARS-CoV-2 Data on Terra which contains the intro text that is currently duplicated. You can then link to the sub-pages using a .. toctree::
on that page, and the hierarchy should also be reflected in the main sidebar.
URLs might look like:
projects/ncov/en/latest/guides/ingest-data-on-terra
projects/ncov/en/latest/guides/ingest-data-on-terra/genbank
projects/ncov/en/latest/guides/ingest-data-on-terra/gisaid
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some comments from a new look at these docs, and another pass through the second section of the GenBank ingest page.
Connect any workspace variables to the wdl ingest workflow | ||
=========================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The steps in this section work, which is great! But, as someone who is unfamiliar with Terra, I don't really understand what I'm doing here. Some lingering questions:
- What does "root entity type" mean, and what is
ncov_examples
? -
Select the 1st row in the data table. The first column should have value
blank
. Selecting more rows will cause the workflow to run more than once.- This works, but seems hacky. What does
blank
mean? - Why are there other rows available if we aren't selecting them? Are those for other use cases?
- This works, but seems hacky. What does
- For the OUTPUTS tab:
- Why can't I just use the default values
this.*
? - What's the significance of using
workspace.*
? - What about the
last_run
variable? Should this be set for proper caching?
- Why can't I just use the default values
- Once submitted, there is a table with column "Data Entity" and the submission has the value "blank (ncov_examples)". What does this mean? I assume understanding (1) and (2) will help here, but I also wonder if the value can be more descriptive (e.g.
default
instead ofblank
).
4. Under **Step 2**: | ||
|
||
1. Click **SELECT DATA**. | ||
2. Select **Choose specific ncov_examples to process**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This shows on Terra as "Choose specific ncov_exampless to process". It looks weird, can it be renamed so there is only one plural s
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, would ncov_example_set
work?
That way, Terra will display "Choose specific ncov_example_sets to process". I'm open to other suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand what a "set" is in Terra, but I'm not sure we want to be using the _set
suffix. I noticed that most data tables created in our Development workspace are accompanied by another table with the same name suffixed with _set
. This other table might be auto-created when data is selected in the process of running a workflow.
I just created a data table victorlin_test_set
and noticed that it generated an additional column victorlin_tests
. This doesn't happen without the _set
suffix.
There's a doc page on set tables which seems relevant.
Import the GenBank ingest wdl workflow from Dockstore | ||
====================================================== |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can the section titles be simplified to "Import the workflow" and "Run the workflow"?
Co-authored-by: Victor Lin <13424970+victorlin@users.noreply.github.com>
Keep link URLs inline rather than referenced separately at the bottom, unless it's used multiple times. #999 (comment)
Add a note that this is a one-time "setup" per user. Users will want to run the workflow more than once, but it's not necessary to follow the initial steps more than once.
Preface Terra ingest instructions with a warning on run time and cost, so users know what to expect and can assess if they have the time/budget to follow these steps
The GISAID Ingest on Terra requires the user to have access to an API endpoint. Direct them to the GISAID webpage.
Provide an example Data Table tsv file to upload to Terra.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some comments from my first pass through the run-analysis-on-terra
page.
2. Click on the radio button **Run workflow(s) with inputs defined by data table**. | ||
3. Under **Step 1**: | ||
|
||
1. Select root entity type as **ncov_examples** from the drop down menu. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the ncov_examples data table that is created in the run-analysis-on-terra
guide? If so, this creates a chicken-egg problem – I'd imagine users would want to run ingest first, followed by the workflow that produces builds.
Suggestion: Put the data table creation in a new terra-workspace-setup
page which would then be a pre-requisite for all the guides here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd imagine users would want to run ingest first, followed by the workflow that produces builds.
We need to clarify this. @huddlej? I'm expecting users to first encounter "run-analysis-on-terra" similar to our SARS-CoV-2 Workflow documentation, where they upload example metadata and sequences files.
After which, the user learns how to pull and include their own context sequences (equivalent to run-XXX-ingest-on-terra
:
But this is a good point, I will reword the beginning of this tutorial to explicitly recommend tutorial order. I think we (or maybe just me) have gotten confused because this tutorial was also trying to address a very specific need (where that individual group would need ingest
, then ncov
).
|
||
Connect your data files to the WDL workflow | ||
=========================================== | ||
|
||
1. On the **DATA** tab, click on **+** next to the **TABLES** section to create a Data Table | ||
#. Download the "sample_template.tsv" file | ||
#. Create a tab delimited file similar to below: | ||
2. Download the "sample_template.tsv" file |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2. Download the "sample_template.tsv" file |
This file is unused, and there is already a link to download another sample file that is more appropriate for this guide.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, agree. I only recently added the "download another sample file" in 7410bc5 but should have made the earlier instructions a note (or link to Terra docs).
Thanks for the feedback! I probably need to link and/or summarize these two Terra concepts:
3. Create a tab delimited file similar to below :download:`example script <./ncov_examples.tsv>`: | ||
|
||
:: | ||
|
||
entity:ncov_examples_id metadata sequences configfile_yaml | ||
blank | ||
example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz | ||
example_build gs://COPY_PATH_HERE/example-build.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: either provide just a download link, e.g.
3. Create a tab delimited file similar to below :download:`example script <./ncov_examples.tsv>`: | |
:: | |
entity:ncov_examples_id metadata sequences configfile_yaml | |
blank | |
example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz | |
example_build gs://COPY_PATH_HERE/example-build.yaml | |
3. Download :download:`ncov_examples.tsv <./ncov_examples.tsv>`. |
or just the text block (there is a handy copy-to-clipboard button on the rendered docs page). If keeping the text block, make sure it's using tabs and not spaces.
+-----------------+-----------------------+--------+--------------------------------+ | ||
|
||
13. Click on the **OUTPUTS** tab | ||
14. Connect your generated output back to the data table, but filling in values: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these are the default values. Suggestion:
14. Connect your generated output back to the data table, but filling in values: | |
14. Click **Use defaults** next to **Attributes** (the last column). |
and remove the markdown table below.
|
||
:: | ||
|
||
entity:ncov_examples_id metadata sequences configfile_yaml | ||
blank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand how this links the ingest workflow output (workspace data files?) to this workflow. Are the gs://
paths still necessary here?
example gs://COPY_PATH_HERE/example_metadata.tsv gs://COPY_PATH_HERE/example_datasets/example_sequences.fasta.gz | ||
example_build gs://COPY_PATH_HERE/example-build.yaml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Add instructions on how to populate these paths. My assumptions:
configfile_yaml
: This should reference a workflow config file uploaded to workspace data files. Thegs://
path can be copied from links on the Workspace Data Files view.metadata
: This should reference a metadata TSV file. (How do you reference the ingest workflow result, is it something likeworkspace.genbank_metadata_tsv
?)sequences
: This should reference a sequences FASTA file.
Description of proposed changes
Since occasional questions about our Terra workflows have been raised during office hours or via DM's, the goal is to update documentation of 3 WDL Terra workflows:
The workflows are separated so that only parameters specific to a particular usecase are shown in Terra.
Related issue(s)
Related to #https://github.com/nextstrain/private/issues/11
Testing
Since this PR should only change the rst files (documentation), testing mainly consist of ensuring that the documents build. Any problematic steps are being fixed if/when they arise with each onboarded user. Comments or suggestions are still welcome though.
Linking the changed pages for ease of proof-reading:
Or the live draft pages: