Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate subsampling config with a script #1102

Closed
wants to merge 10 commits into from

Conversation

victorlin
Copy link
Member

@victorlin victorlin commented Mar 13, 2024

Description of proposed changes

Shift to having one subsampling configuration YAML per build.

Motivations:

  1. The current approach of copy/pasting is getting unwieldy for making additions/modifications to subsampling config.
  2. There have been recent efforts to design a generic subsampling tool + config schema. If we are to use such a tool in this workflow, the {region} templating must be lifted out of Snakemake so that a complete YAML config file can be given to the subsampling tool. Without a script, this would result in even more copy/pasting.

Checklist

  • Start using augur subsample here?
  • Consider pulling out other duplicated config (builds, traits, frequencies)?
  • Test with trial builds
  • Changelog?
  • Release new version of workflow?

scratch

dropped:

@victorlin victorlin self-assigned this Mar 13, 2024
@trvrb
Copy link
Member

trvrb commented Mar 21, 2024

I'm afraid I'm confused as to the push here. If you were to implement weighted subsampling in augur (per nextstrain/augur#1318 (comment)), then we could take the existing YAML files like https://github.com/nextstrain/ncov/blob/master/nextstrain_profiles/nextstrain-gisaid/builds.yaml#L672 and slim them down considerably. I find the pattern of script generated intermediate files confusing and only want them when really necessary. Again, I'm confused here as it would seem that any work to templatize this would want to wait until augur subsample exists (otherwise we're just spinning our wheels).

@trvrb
Copy link
Member

trvrb commented Mar 21, 2024

Okay... maybe I'm catching up. The current YAML files generated by nextstrain_profiles/nextstrain-gisaid/generate-subsampling-config.py like nextstrain_profiles/nextstrain-gisaid/subsampling/global_6m.yaml are meant as just a refactor? And that these current YAML files would be replaced by the version that would be compatible with a future augur subsample? I have to admit, it's still seeming like work that doesn't need to be done yet, but I don't have a great sense for how you're planning to stage things.

A stepping stone towards generating these configs via a script.
These can no longer be used with the shift to using untemplated
subsampling configuration files.
This does not have any templates but does not make sense to keep around
after other geographically based defaults have been removed.
@victorlin victorlin force-pushed the victorlin/generate-subsampling-config branch from 11b4c64 to 37c68b8 Compare March 28, 2024 02:36
@victorlin
Copy link
Member Author

The current YAML files … are meant as just a refactor? And that these current YAML files would be replaced by the version that would be compatible with a future augur subsample?

Yes, I'm planning to bridge the gap between current setup and augur subsample with a translation to individual subsampling config YAMLs which will then be modified directly to see the difference in config schema.

I have to admit, it's still seeming like work that doesn't need to be done yet, but I don't have a great sense for how you're planning to stage things.

Fair point. Before seeing your comment (was offline on an airplane), I continued and expanded the script to cover all profiles/builds. By then, I also started thinking that the scoping of this PR isn't right.

I'm not sure how where to take this PR - maybe it should wait and augur subsample be added here directly. So far, it's been useful exploring these ideas:

  1. Decoupling subsampling config from Snakemake config. This is necessary in order to test augur subsample on this workflow.
  2. Deduplicating subsampling config among profiles. Identical subsampling config can continue to exist separately across profiles such as gisaid/gisaid-21L/open, but now is a good opportunity to get rid of that duplication and define subsampling config in a place that's more shared, especially if we'll need to bulk update config schema for augur subsample.
  3. Dissecting subsampling logic. I made the script dynamically calculate sample sizes using weights, which is helpful in thinking about how to implement weighted subsampling even if we don't end up merging the script here.

@victorlin victorlin force-pushed the victorlin/generate-subsampling-config branch 2 times, most recently from 40d5dfe to fc84ca1 Compare April 11, 2024 00:32
There are a few changes that don't impact functionality.
This implementation has some issues, but it's a stepping stone in the
switch to use augur subsample.
@victorlin victorlin force-pushed the victorlin/generate-subsampling-config branch from a8f66bb to 3b60b82 Compare May 1, 2024 23:39
@victorlin
Copy link
Member Author

Update: I've moved 3b60b82 over to #1106. Currently focusing on that instead of the larger subsampling changes in this PR.

Closing this for now.

@victorlin victorlin closed this May 23, 2024
@victorlin victorlin added the revisit sometime Useful to address but no bandwidth at the moment label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
revisit sometime Useful to address but no bandwidth at the moment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants