Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow a list of sequence exclusion files, instead of just one #977

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

sacundim
Copy link

The files.exclude parameter to the workflow only allows a single sequence exclusion file per build, which means that if I want to use a custom exclusion file in my custom build, I have to either:

  1. Forego all the excellent work that's gone into the defaults/exclude.txt;
  2. Manually combine that file and my own excludes, and merge upstream changes thereafter

Proposed feature: the workflow should allow the files.exclude parameter to be a list, in which case it'll just pass all of the listed files to augur filter --exclude. The configuration would look for example like this:

files:
  auspice_config: "puerto-rico_profiles/puerto-rico_open/puerto-rico_auspice_config.json"
  description: "puerto-rico_profiles/puerto-rico_open/puerto-rico_description.md"
  exclude:
    - "defaults/exclude.txt"
    - "puerto-rico_profiles/puerto-rico_open/exclude.txt"

I prototyped this in my own custom workflows and verified that it is indeed using both exclude files:

Some logs that verify that it worked:

augur filter \
            --metadata nextstrain-data/files/ncov/open/metadata.tsv.gz \
            --include defaults/include.txt \
            --exclude defaults/exclude.txt puerto-rico_profiles/puerto-rico_open/exclude.txt \
            --min-date 6M \
            --query 'country != '"'"'USA'"'"'' \
            --group-by region year month \
            --subsample-max-sequences 800 \
            --output-strains results/puerto-rico/sample-global_late.txt \
    2>&1 \| tee logs/subsample_puerto-rico_global_late.txt
 
Sampling at 2 per group.
5176893 strains were dropped during filtering
641 of these were dropped because they were in defaults/exclude.txt
2639423 of these were filtered out by the query: "(country == 'USA' & division != 'Puerto Rico')"
1867961 of these were dropped because they were earlier than 2022.03 or missing a date
614 of these were dropped because they were in puerto-rico_profiles/puerto-rico_open/exclude.txt
355 were dropped during grouping due to ambiguous month information
2 were dropped during grouping due to ambiguous year information
2 strains were added back because they were in defaults/include.txt
667899 of these were dropped because of subsampling criteria
656 strains passed all filters

…, but optionally a list of files, that will all be passed to `augur filter --exclude`.
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature sounds reasonable to match the capabilities of augur filter, especially since a list of inclusion files is already allowed in the workflow:

include = config["files"]["include"],
exclude = _collect_exclusion_files,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

None yet

2 participants