New ingest one snakefile #10

j23414 · 2023-06-20T15:24:13Z

Description of proposed changes

Explore using Snakemake modules for ingesting data into the Nextstrain build system.
This PR would merge into the new_ingest branch and continue the dengue ingest work.

The running behavior is:

If data directory exists, proceed with build as normal.
If s3_src is pointing to a pre-cleaned dengue dataset, fetch dataset and proceed as normal.
Otherwise, fetch dataset using ingest, clean it, and proceed as normal.

In order to implement the above, we are using Snakemake modules¹
which allows us to define a prefix for the ingest input/output paths.
However, these do not get applied to the scripts called within shelL .
So we are using this workaround² to add a base dir to the scripts.

¹ https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#modules
² https://stackoverflow.com/a/66890412

Testing

Checks pass

Can run a manual check with the following commands:

git clone https://github.com/nextstrain/dengue.git
cd dengue
git checkout new_ingest_one_snakefile 

# Runs both ingest and build
nextstrain build .

Although the following should still work too:

cd ingest
rm -rf data logs  # cleanup any previously generated files if there
nextstrain build .

Future commits will change this to work with Dengue data

If pathogen is not listed in nextclade_data, remove nextclade rules and scripts until it is added. https://github.com/nextstrain/nextclade_data/tree/release/data/datasets

If a script does not need to be modified for a pathogen ingest, pull script during runtime instead of maintaining a potentially diverging copy. Use a permalink for each script to allow us to version the software we use in this workflow without being affected by upstream changes until we want to bump the version. This design adds more maintenance to this workflow, but it also protects users against unexpected issues that are outside of their control. Discussed in nextstrain/ebola#6 (comment) However, this commit gets reverted later based on discussions.

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

Dengue requires special handling because it has multiple serotypes. Added dengue serotypes: all, denv1, denv2, denv3 and denv4 Co-authored-by: Jover Lee <joverlee521@gmail.com>

The dengue genome is approximately 10k or 11k. Therefore, we can filter out any sequences that are less than 5k or greater than 15k. A list of added GenBank filters is below: * Pull sequences longer than 5k but less than 15k * Only pull VRL (viral) datasets (no PAT or patents) * Pull UpdateDate_dt entry to potentially only pull "recent data sets" in case the dataset gets too large Co-authored-by: Jover Lee <joverlee521@gmail.com>

Since strain name may be in Isolate_s or Strain_s, we need to check both columns for a reasonable strain name. Dengue virus types denv1 to 4 can be derived if their NCBI taxon IDs are listed in ViralLineage_IDs. * derive strain name from Strain_s if Isolate_s is blank * derive denv1 to 4 depending on ViralLineage_IDs

* update help statement * make --outfile required * simplify reordering output columns * nuanced viruslineage_ids processing * when multiple paper urls, pick one * 'strain' and 'strain_s' were populated by 'Isolate_s' and 'Strain_s' pulled from genbank_url The following was added after discussion with trs Check for the non-"happy path" cases first and then return early (or erroring early, as the case may be). This leaves the "happy path" (or "expected path") as the remainder of the function. * return early if publications is empty Co-authored-by: Thomas Sibley <tom@zulutango.org>

Search for valid strain name in the following order: 'strain', 'strain_s', 'accession'. Move the order into configs instead of hardcoding it in the post_process_metadata.py script.

Co-authored-by: Thomas Sibley <tom@zulutango.org>

Since some strains (or isolates) may be resequenced resulting in duplicate strain names in the dengue dataset, index entries by GenBank Accession IDs.

Could not find genbank accession from GenBank or prior sequences.fasta.zst files.

Compromise by duplicating scripts from monkeypox until a generalized pathogen repository exists or these scripts get enfolded into an augur subcommands

Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious. Following the format shown at: nextstrain/ncov#1045

Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3 denv4), reverted this rule to original code of using 1 thread. However, added a threads parameter in the align rule so that this is easy to modify.

To simplify the workflow, instead of post processing metadata to clean up strain names and set dengue serotype based on virus lineage ID after the transform step, incorporate post processing directly into the transform step. This step was moved above any manual annotations. This also simplified the code so we were not having two code blocks determining the final metadata columns which may have become inconsistent.

…n build system The running behavior is: * If `data` directory exists, proceed with build as normal. * If `s3_src` is pointing to a pre-cleaned dengue dataset, fetch dataset and proceed as normal. * Otherwise, fetch dataset using ingest, clean it, and proceed as normal. In order to implement the above, we are using Snakemake modules¹ which allows us to define a `prefix` for the ingest input/output paths. However, these do not get applied to the scripts called within `shelL` . So we are using this workaround² to add a base dir to the scripts. ¹ https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#modules ² https://stackoverflow.com/a/66890412 Co-authored-by: Jover Lee <joverlee521@gmail.com>

Changes copied from ncov-ingest repo: nextstrain/ncov-ingest@3dbf473

Only use the default configfiles if no configs are provided. This allows users to override the default configs with Snakemake commandline options.

Use the config to set the ingest basedir so that we're not using some arbitrary variable.

This commit modifies the configurations and snakemake rules to treat geolocation-rules.tsv and annotations.tsv as input files instead of strings. This change addresses a warning message that occurs when using relative file paths starting with './'. The warning suggests omitting the './' to avoid redundancy and ensure consistent file matching in Snakemake. ``` Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths. ``` Furthermore, by treating source-data/annotations.tsv as a file, it provides hints to snakemake modules to add a prefix, such as "ingest/source-data/", if necessary. This modification aligns with the changes made in the following pull request: nextstrain/dengue#10

jameshadfield · 2023-06-26T21:48:59Z

Snakefile


 serotypes = ['all', 'denv1', 'denv2', 'denv3', 'denv4']

 rule all:
    input:
        auspice_json = expand("auspice/dengue_{serotype}.json", serotype=serotypes)

+module ingest_workflow:


I haven't used modules myself, so I'm interested to know how you and @joverlee521 found them to work with - especially related to debugging / developing?

As a counterexample, I've been playing with an HBV build which uses our convention of separating out ingest data and code into ./ingest/, but still runs everything from a top-level ./Snakemake (which calls include: "ingest/ingest.smk").

The advantage of modules (as used in this context) is that they continue to allow ./ingest/ to function in a completely stand-alone way, which has obvious advantages. Perhaps using modules also enforces the separation of concerns between ingest + phylo in a desirable way? The cost is increased complexity. After seeing how complex the ncov snakemake workflow became over the years I'm trying to gauge whether the trade-offs here are worth it.

) This commit modifies the configurations and snakemake rules to treat geolocation-rules.tsv and annotations.tsv as input files instead of strings. This change addresses a warning message that occurs when using relative file paths starting with './'. The warning suggests omitting the './' to avoid redundancy and ensure consistent file matching in Snakemake. ``` Relative file path './source-data/geolocation-rules.tsv' starts with './'. This is redundant and strongly discouraged. It can also lead to inconsistent results of the file-matching approach used by Snakemake. You can simply omit the './' for relative file paths. ``` Furthermore, by treating source-data/annotations.tsv as a file, it provides hints to snakemake modules to add a prefix, such as "ingest/source-data/", if necessary. This modification aligns with the changes made in the following pull request: nextstrain/dengue#10

j23414 · 2024-05-23T22:19:04Z

The purpose of this was some early exploration of dafdd12 to support placing workflows in subdirectories.

Closing this, since we ended up going a different direction: https://github.com/nextstrain/cli/releases/tag/8.2.0.

j23414 and others added 30 commits June 6, 2023 05:46

Ingest: Copy ingest from monkeypox repo

9be3cc9

Future commits will change this to work with Dengue data

Ingest: Ignore ingest cache directories

d91180e

Ingest: Remove Nextclade

0700db2

If pathogen is not listed in nextclade_data, remove nextclade rules and scripts until it is added. https://github.com/nextstrain/nextclade_data/tree/release/data/datasets

fix: Use curl for downloading files

af25286

Pick curl instead of detecting curl/wget as discussed in: nextstrain/ebola#6 (comment)

Ingest: Replace monkeypox text and parameters with dengue

1400d2f

Dengue-specific-ingest: Add dengue serotype wildcards

9ada261

Dengue requires special handling because it has multiple serotypes. Added dengue serotypes: all, denv1, denv2, denv3 and denv4 Co-authored-by: Jover Lee <joverlee521@gmail.com>

[ingest] Simplify finding strain name

938feaf

Search for valid strain name in the following order: 'strain', 'strain_s', 'accession'. Move the order into configs instead of hardcoding it in the post_process_metadata.py script.

zstd compress output files

f7bc33b

fix: makes the compress rule more generic

3fccb81

Co-authored-by: Thomas Sibley <tom@zulutango.org>

Build: Index by genbank accession instead of duplicate strain names

7ecc29a

Since some strains (or isolates) may be resequenced resulting in duplicate strain names in the dengue dataset, index entries by GenBank Accession IDs.

fix: remove entries where accession is not found

d6eadd0

Could not find genbank accession from GenBank or prior sequences.fasta.zst files.

Ingest: Compromise by duplicating scripts

6c21a39

Compromise by duplicating scripts from monkeypox until a generalized pathogen repository exists or these scripts get enfolded into an augur subcommands

Ingest: Replace monkeypox text and parameters with dengue in scripts

360b383

Ingest: Compromise by allowing redundant data pull by serotype

b724b96

[wip] attempt at limiting concurrent deploys

b8f3b8d

Since fetch_from_genbank can query NCBI up to 5 times for each of the serotypes, try to limit concurrent queries to under 3. Using 2 to be cautious. Following the format shown at: nextstrain/ncov#1045

Build: parameterize threads in align rule

755d287

Since align may be running in 5 parallel jobs (all, denv1, denv2, denv3 denv4), reverted this rule to original code of using 1 thread. However, added a threads parameter in the align rule so that this is easy to modify.

docs: Add documentation on running ingest

a78ac90

fix: wildcards paired with optional.yaml

4e28f33

cleanup some unused metadata columns

4823f72

mark temp intermediate files

247b2fd

fix: annotations is a file, not a param

d008d5a

refactor: parameterize data s3 source

77d1a07

ingest: add zstd support

8d3ef7d

Changes copied from ncov-ingest repo: nextstrain/ncov-ingest@3dbf473

Snakefile: allow configfile overrides

351b3ec

Only use the default configfiles if no configs are provided. This allows users to override the default configs with Snakemake commandline options.

Update workaround for ingest shell

dafdd12

Use the config to set the ingest basedir so that we're not using some arbitrary variable.

j23414 force-pushed the new_ingest_one_snakefile branch from bca3415 to dafdd12 Compare June 20, 2023 15:54

j23414 mentioned this pull request Jun 23, 2023

fix: Treat geolocation-rules.tsv and annotations.tsv as input files nextstrain/rsv#28

Merged

1 task

jameshadfield reviewed Jun 26, 2023

View reviewed changes

j23414 marked this pull request as draft August 15, 2023 18:52

j23414 added the wontfix This will not be worked on label Aug 15, 2023

j23414 force-pushed the new_ingest branch from a6df20f to 9dbccc4 Compare August 19, 2023 01:43

j23414 closed this May 23, 2024

j23414 deleted the new_ingest_one_snakefile branch May 23, 2024 22:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New ingest one snakefile #10

New ingest one snakefile #10

j23414 commented Jun 20, 2023 •

edited

jameshadfield Jun 26, 2023

j23414 commented May 23, 2024

New ingest one snakefile #10

New ingest one snakefile #10

Conversation

j23414 commented Jun 20, 2023 • edited

Description of proposed changes

Testing

jameshadfield Jun 26, 2023

Choose a reason for hiding this comment

j23414 commented May 23, 2024

j23414 commented Jun 20, 2023 •

edited