Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep small number of BA.2 on the tree and not filtered by clock_filter #874

Open
danrlu opened this issue Feb 28, 2022 · 2 comments
Open
Assignees
Labels
enhancement New feature or request

Comments

@danrlu
Copy link
Contributor

danrlu commented Feb 28, 2022

Context
On a general, all-lineage tree, when we have only 1 focal BA.2 sample, which pulled in just a few contextual BA.2 samples, most BA.2 samples ended up in excluded_by_diagnostics.txt file and my best understanding is they got removed by the clock_filter set here. When not having enough (<100) BA.2 samples and their offset fell back to the default of 2 (per here). Below from the GenBank open data Nextstrain prepares (thank you!), I calculated the deviation and it's near 25 for BA.2 (21L), so the combination of a default offset = 2 and clock_filter = 20 will nicely excludes many of them XD
image

Description
Have a way to keep lineages w large number of real mutations in the tree more robustly.

Possible solution
For our build, I (lacking enough Python skills) changed the code to use a much smaller number of samples to calculate offset. A more robust solution may be to feed in a fixed offset for each clade that's calculated using a large dataset and not to rely on the samples in each particular tree.

Somewhat related to #852.

@corneliusroemer
Copy link
Member

The enhancement request is easy to satisfy by factoring out the min_clade_member_count_threshold_for_offset 100 as an optional parameter (with default 100 for backwards compatibility). If that number is made smaller, clock filter becomes noisier for small clades, but it's sometimes (at user's risk) better than throwing out all clade members because they don't reach 100.

Within nextstrain we don't have this problem (usually) because new clades usually have more than 100 members in GISAID/Genbank right from the beginning. The 100 may thus not be very carefully chosen.

@danrlu
Copy link
Contributor Author

danrlu commented Mar 3, 2022

Updated:
We're wrapping up this sprint and will leave this issue to the original assignee ;)

Original:
We're planning to tackle #852 this sprint. @corneliusroemer do you want us to bring along this one (to surface min_clade_member_count_threshold_for_offset as a parameter per your suggestion) as well? Since they together control clock filter (is that right??)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
No open projects
Status: Prioritized
Development

No branches or pull requests

2 participants