Implement subsampling via a script #711

jameshadfield · 2021-08-23T08:26:08Z

Description of proposed changes

This implements subsampling via a single script, rather than a series of snakemake commands. The behavior is intended to be the same, and the the subsampling definitions (in builds.yaml) have not changed. This script was initially developed as augur subsample in nextstrain/augur#762, however we are choosing to move development here to allow us to improve the proximity/priority calculations here. The eventual aim is for this functionality to be part of augur.

Verbosity
The subsample script contains overly verbose print statements. Some of these should be removed before merge, but they may be helpful for review.

Subsampling syntax
We introduce a new rule, extract_subsampling_scheme, which takes the subsampling definition (in builds.yaml), applies wildcard expansion, and then transforms this into a syntax more in line with how augur filter is called. This rule replaces a number of functions formerly in the snakefiles. As an example:

## builds.yaml
subsampling:
  custom-scheme:
    sampleName:
      exclude: "--exclude-where 'aus!=yes'"
      group_by: year month
      seq_per_group: 5

# output from rule extract_subsampling_scheme
sampleName:
  exclude-where:
  - aus!=yes
  group-by:
  - year
  - month
  sequences-per-group: 5

Going forward, I suggest slowly updaating our nCoV subsampling definitions to the latter format. This syntax is more in line with the rest of the augur ecosystem and avoids the sticking point where some values need to specify the --argument and some don't (this has tripped up people). It is also easier (I think) to reason with YAML arrays than strings which are then coerced into arrays.

DAG simplification

By essentially bringing the complexity into the subsample script (eventually to be augur subsample), we have a simpler snakemake DAG, with far fewer rules, which should be easier to port to other workflow languages. The DAG for subsampling now only contains two steps per build: extract_subsampling_scheme (see above) and subsample.

DAG for Nextstrain Open (GenBank) builds. Top: current master, bottom: this PR

Future improvements

Instead of using augur filter’s run() function, we could refactor that function slightly to have a function which returns a strain list for inclusion, as well as logging data etc. Currently the run() function writes data to disk which we immediately read in.
Similarly, this work doesn't change the behavior of the priorities.py and get_distance_to_focal_set.py scripts. These can be refactored to return data rather than writing to disk.
Include / exclude files are defined by arguments to augur subsample, however they could be set within the scheme definition, which would allow per-sample differences. (Breaking change.)
Allow a log file to be written explaining why strains were not included (or why they were!) which leverages the new logging functionality of augur filter
Expand priorities schema definition so that sample definitions define ignore_seqs, which is currently hardcoded to be "Wuhan/Hu-1/2019". Similarly for proximity / priority config parameters. (Breaking change.)
For samples which can be computed in parallel (e.g. those which don’t need priorities), in principle we could refactor the code in augur filter to allow independent sets of (sets of) filters to be applied each time we iterate over a chunk of metadata. This should speed things up quite a bit, but would come at increased code complexity. I don’t think this is immediately necessary.
Proximity calculations can be slow for large datasets. We can speed this up by prefiltering based on certain settings - for instance, if we are computing a sample with a --min-date and proximity to another sample, we can do this in two stages so that the proximity calculations are run against a pre-filtered set (via `--min-date).

Related issue(s)

Related to nextstrain/augur#635

Testing

Test builds triggered (via GitHub). I will update this section when they complete. Update: they failed. Updated PR, and AWS info below updated.

## open
 nextstrain build --aws-batch --attach 8008fccf-24de-4d64-86c8-d5e0a887cc54   .
https://nextstrain.org/staging/ncov/open/trial/15426a2dbfbb0739560c431c1db1da7d0f60184b/{global, ...}

AWS batch console link

Release checklist

This PR should not introduce any backwards incompatible changes

Update docs/change_log.md in this pull request to document these changes by the date they were added.

This is in preparation for a ubsample script to call these python functions directly, as opposed to scripts called from the command line. In time, this script will become `augur subsample`. The work done by the functions is unchanged, thus they still don't return any information rather they write information to disk. Future work will change this so that information can flow back to the calling program without needing to be written to disk.

This implements a new script to encapsulate the subsampling logic formerly encoded in snakemake rules. This is in preparation for moving this script to the augur repo where it will become `augur subsample`. (We have chosen to develop this in the ncov repo for simplicity.) The script currently uses the same approach as the former snakemake rules, however python functions are called rather than scripts / augur commands. Briefly, the steps are: 1. A subsampling scheme is provided, parsed, validated, and turned into a simple graph to indicate which samples rely on other samples having been computed (i.e. which are needed for priorities) 2. Each sample is computed by calling the run function of augur filter 3. If priorities need to be calculated for a sample to be computed, this is achieved by calling functions from the two existing scripts. 4. The set of sequences to include in each sample is combined, and outputs written.

jameshadfield added 2 commits August 23, 2021 16:32

Update input paths for multiple-inputs example profile

ed9d3c2

jameshadfield force-pushed the subsample branch from 48d89c7 to cf752c4 Compare August 24, 2021 03:22

jameshadfield force-pushed the subsample branch 2 times, most recently from 15426a2 to f7b1169 Compare August 25, 2021 07:42

jameshadfield mentioned this pull request Dec 8, 2021

Implement subsampling as a script (or eventually "augur subsample") nextstrain/augur#805

Closed

huddlej mentioned this pull request Dec 8, 2021

Prototype a linear version of our Snakemake workflow in WDL #809

Closed

huddlej mentioned this pull request Dec 21, 2021

Move subsampling by proximity into augur #816

Open

huddlej self-requested a review December 22, 2021 20:36

huddlej assigned huddlej and jameshadfield Dec 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement subsampling via a script #711

Implement subsampling via a script #711

jameshadfield commented Aug 23, 2021 •

edited

Implement subsampling via a script #711

Are you sure you want to change the base?

Implement subsampling via a script #711

Conversation

jameshadfield commented Aug 23, 2021 • edited

Description of proposed changes

Related issue(s)

Testing

Release checklist

jameshadfield commented Aug 23, 2021 •

edited