Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New ingest one snakefile #10

Closed
wants to merge 31 commits into from
Closed

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jun 20, 2023

Description of proposed changes

Explore using Snakemake modules for ingesting data into the Nextstrain build system.
This PR would merge into the new_ingest branch and continue the dengue ingest work.

The running behavior is:

  • If data directory exists, proceed with build as normal.
  • If s3_src is pointing to a pre-cleaned dengue dataset, fetch dataset and proceed as normal.
  • Otherwise, fetch dataset using ingest, clean it, and proceed as normal.

In order to implement the above, we are using Snakemake modules¹
which allows us to define a prefix for the ingest input/output paths.
However, these do not get applied to the scripts called within shelL .
So we are using this workaround² to add a base dir to the scripts.

¹ https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#modules
² https://stackoverflow.com/a/66890412

Testing

  • Checks pass

Can run a manual check with the following commands:

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout new_ingest_one_snakefile 

# Runs both ingest and build
nextstrain build .

Although the following should still work too:

cd ingest
rm -rf data logs  # cleanup any previously generated files if there
nextstrain build .

j23414 and others added 30 commits June 6, 2023 05:46
Future commits will change this to work with Dengue data
If pathogen is not listed in nextclade_data, remove nextclade rules and
scripts until it is added.

https://github.com/nextstrain/nextclade_data/tree/release/data/datasets
If a script does not need to be modified for a pathogen ingest, pull
script during runtime instead of maintaining a potentially diverging
copy.

Use a permalink for each script to allow us to version the software we
use in this workflow without being affected by upstream changes until
we want to bump the version. This design adds more maintenance to this
workflow, but it also protects users against unexpected issues that are
outside of their control.
Discussed in nextstrain/ebola#6 (comment)

However, this commit gets reverted later based on discussions.
Pick curl instead of detecting curl/wget as discussed in:
nextstrain/ebola#6 (comment)
Dengue requires special handling because it has multiple serotypes.
Added dengue serotypes: all, denv1, denv2, denv3 and denv4

Co-authored-by: Jover Lee <joverlee521@gmail.com>
The dengue genome is approximately 10k or 11k. Therefore, we can filter
out any sequences that are less than 5k or greater than 15k.

A list of added GenBank filters is below:

* Pull sequences longer than 5k but less than 15k
* Only pull VRL (viral) datasets (no PAT or patents)
* Pull UpdateDate_dt entry to potentially only pull "recent data sets"
  in case the dataset gets too large

Co-authored-by: Jover Lee <joverlee521@gmail.com>
Since strain name may be in Isolate_s or Strain_s, we need to check both
columns for a reasonable strain name. Dengue virus types denv1 to 4 can
be derived if their NCBI taxon IDs are listed in ViralLineage_IDs.

* derive strain name from Strain_s if Isolate_s is blank
* derive denv1 to 4 depending on ViralLineage_IDs
* update help statement
* make --outfile required
* simplify reordering output columns
* nuanced viruslineage_ids processing
* when multiple paper urls, pick one
* 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s'
pulled from genbank_url

The following was added after discussion with trs

Check for the non-"happy path" cases first and then return early (or
erroring early, as the case may be). This leaves the "happy path" (or
"expected path") as the remainder of the function.

* return early if publications is empty

Co-authored-by: Thomas Sibley <tom@zulutango.org>
Search for valid strain name in the following order: 'strain', 'strain_s',
'accession'. Move the order into configs instead of hardcoding it in the
post_process_metadata.py script.
Co-authored-by: Thomas Sibley <tom@zulutango.org>
Since some strains (or isolates) may be resequenced resulting in duplicate
strain names in the dengue dataset, index entries by GenBank Accession IDs.
Could not find genbank accession from GenBank or prior sequences.fasta.zst files.
Compromise by duplicating scripts from monkeypox until a generalized
pathogen repository exists or these scripts get enfolded into an augur
subcommands
Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious.

Following the format shown at:
nextstrain/ncov#1045
Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3
denv4), reverted this rule to original code of using 1 thread. However,
added a threads parameter in the align rule so that this is easy to modify.
To simplify the workflow, instead of post processing metadata to clean up
strain names and set dengue serotype based on virus lineage ID after the
transform step, incorporate post processing directly into the transform step.
This step was moved above any manual annotations. This also simplified the
code so we were not having two code blocks determining the final metadata columns
which may have become inconsistent.
…n build system

The running behavior is:

* If `data` directory exists, proceed with build as normal.
* If `s3_src` is pointing to a pre-cleaned dengue dataset, fetch dataset and proceed as normal.
* Otherwise, fetch dataset using ingest, clean it, and proceed as normal.

In order to implement the above, we are using Snakemake modules¹
which allows us to define a `prefix` for the ingest input/output paths.
However, these do not get applied to the scripts called within `shelL` .
So we are using this workaround² to add a base dir to the scripts.

¹ https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#modules
² https://stackoverflow.com/a/66890412

Co-authored-by: Jover Lee <joverlee521@gmail.com>
Changes copied from ncov-ingest repo:
nextstrain/ncov-ingest@3dbf473
Only use the default configfiles if no configs are provided. This allows
users to override the default configs with Snakemake commandline options.
Use the config to set the ingest basedir so that we're not using some
arbitrary variable.
j23414 added a commit to nextstrain/rsv that referenced this pull request Jun 23, 2023
This commit modifies the configurations and snakemake rules to treat geolocation-rules.tsv and annotations.tsv as input files instead of strings. This change addresses a warning message that occurs when using relative file paths starting with './'. The warning suggests omitting the './' to avoid redundancy and ensure consistent file matching in Snakemake.

```
Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths.
```

Furthermore, by treating source-data/annotations.tsv as a file, it provides hints to snakemake modules to add a prefix, such as "ingest/source-data/", if necessary. This modification aligns with the changes made in the following pull request: nextstrain/dengue#10

serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']

rule all:
input:
auspice_json = expand("auspice/dengue_{serotype}.json", serotype=serotypes)

module ingest_workflow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't used modules myself, so I'm interested to know how you and @joverlee521 found them to work with - especially related to debugging / developing?

As a counterexample, I've been playing with an HBV build which uses our convention of separating out ingest data and code into ./ingest/, but still runs everything from a top-level ./Snakemake (which calls include: "ingest/ingest.smk").

The advantage of modules (as used in this context) is that they continue to allow ./ingest/ to function in a completely stand-alone way, which has obvious advantages. Perhaps using modules also enforces the separation of concerns between ingest + phylo in a desirable way? The cost is increased complexity. After seeing how complex the ncov snakemake workflow became over the years I'm trying to gauge whether the trade-offs here are worth it.

j23414 added a commit to nextstrain/rsv that referenced this pull request Jul 6, 2023
)

This commit modifies the configurations and snakemake rules to treat geolocation-rules.tsv and annotations.tsv as input files instead of strings. This change addresses a warning message that occurs when using relative file paths starting with './'. The warning suggests omitting the './' to avoid redundancy and ensure consistent file matching in Snakemake.

```
Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths.
```

Furthermore, by treating source-data/annotations.tsv as a file, it provides hints to snakemake modules to add a prefix, such as "ingest/source-data/", if necessary. This modification aligns with the changes made in the following pull request: nextstrain/dengue#10
@j23414 j23414 marked this pull request as draft August 15, 2023 18:52
@j23414 j23414 added the wontfix This will not be worked on label Aug 15, 2023
@j23414
Copy link
Contributor Author

j23414 commented May 23, 2024

The purpose of this was some early exploration of dafdd12 to support placing workflows in subdirectories.

Closing this, since we ended up going a different direction: https://github.com/nextstrain/cli/releases/tag/8.2.0.

@j23414 j23414 closed this May 23, 2024
@j23414 j23414 deleted the new_ingest_one_snakefile branch May 23, 2024 22:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
wontfix This will not be worked on
Projects
No open projects
Status: Backlog
Development

Successfully merging this pull request may close these issues.

None yet

3 participants