Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

Restrict 6m builds to tips in the last 6m (post-tree-building) #934

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

jameshadfield
Copy link
Member

@jameshadfield jameshadfield commented Apr 28, 2022

This introduces a script which restricts the 6m builds to include tips from the previous 6 months only. This is done after tree building, so that tree reconstruction / inference can use data from the entire pandemic. After tips are restricted we then partition the tips based on clades and visualise them as subtrees. If a clade splits after the cutoff (i.e. in the last 6 months) then it's not drawn as a separate subtree.

I went through a number of iterations of this approach, and there are lots of judgement calls to be made. Very open to modifications / changes / comments 馃槃

Specifically, most of my test datasets had some recent tips of an old clade (e.g 20A, 20C) and the coalescence of those tips was a long time ago. This pushed the root of at least one subtree quite far back. Even just a few delta strains will typically push the coalescence back to around the start of 2021.

For instance, here's the global 6m build unrestricted (left), restricted to 6M (middle) and 3M (right). Notice the difference in minimum date due to tips from 20B not appearing in the last 3 months:

image

Here's the same restrictions but for the build from #933. Note how BA4/5 aren't split into separate subtrees as they branch off after the cutoff date. (I played around with using different cutoffs here, the further back you push this the more structure comes through.)

image

And for good measure, here's the h1n1pdm tree focused on the past 6 months (left) and restricted to the last 6 months (right) (colours are different because they aren't defined in the dataset, they're created by auspice):

image

Trial builds running now: https://github.com/nextstrain/ncov/actions/runs/2237163006

@emmahodcroft
Copy link
Member

This is cool James! Just to be sure I'm understanding, I'm guessing the idea here is to allow a more focused view of the tips, without having a tree that 'drags' back further into the past than people may be interested in looking? In other words, it allows a 'shortening' of the X-axis time span? (Or is there something grander at play than I'm picking up - very possible!)

In the same vein, when you say things like "push the coalescence back" you mean just visually, right? Since you say they're the same tree-building, the actual coalescence stays the same (seems to, across the views, I think), but the visual 'stretching' depends on how deeply branches go (which is definitely impacted by where you set the cut off).

Would the proposal be to make this an toggle-able view of the 6m builds, or the 'only' view? I do think it's very cool, but am wondering if people may get the idea that SARS-CoV-2 has split into different... somethings. I'll let the imagination of the headlines run wild in everyone's own head... 馃檭 On the other hand, someone's always willing to misinterpret, and we can't let that stop us trying to get clearer & better visuals!

@jameshadfield
Copy link
Member Author

Just to be sure I'm understanding, I'm guessing the idea here is to allow a more focused view of the tips, without having a tree that 'drags' back further into the past than people may be interested in looking?

Exactly! It's to focus on the within-clade relationships of what's circulating recently, rather than convey the evolutionary history of the different clades.

If we showed the entire connected history of those selected tips then (a) the root node would often be so far back that we don't gain that much horizontal space and (b) it won't work for recombinants. So I'm partitioning the data (via clade label) and showing the connected history of those partitions. Unfortunately, if you get a couple of recent "20A" samples (for instance), then that set of tips coalesces a long time ago, thus pushing back the x-axis min-date. There are ways to fix this, for instance specifying another time limit and cutting branches off if they go before this, but they're not ideal.

@jameshadfield
Copy link
Member Author

jameshadfield commented Apr 28, 2022

Force-pushed & rerunning (GitHub Action) as our docker image's python doesn't like walruses!

Update: This failed at a downstream rule 馃う

File "scripts/fix-colorings.py", line 29, in recurse
value = node.get("node_attrs", {}).get(EPIKEY, {}).get("value", False)
AttributeError: 'list' object has no attribute 'get'

Update again: https://github.com/nextstrain/ncov/actions/runs/2243157880 triggered after 9029cea

@emmahodcroft
Copy link
Member

Thanks James! That makes sense and I really like this idea.

One thought - though may be beyond the scope of this PR - is whether this is something that would be desirable/possible to introduce on Auspice end rather than build end. Like another version of the date slider, where instead of a 'vertical' cutoff (all being along the same line) it was a 'tip' cutoff (like the above) where it then only shows back to the coalescent point of tips in the specified period. And instead of 'greying out' the branches prior to this, do away with them completely and rescale the X-axis. This would let users view or not view the deeper tree, or set their limit to 3m, 6m, etc (of course, there could be a default, too!), rather than 'losing' the information entirely. Could also then be applied to a tree without additional running!

Base automatically changed from multiple-timespans to master April 30, 2022 18:33
@trvrb trvrb force-pushed the restrict-tree-via-date-cutoff branch from 9029cea to d981566 Compare April 30, 2022 23:13
See script for comments / methods, as well as added configuration
documentation.
Time-restricted builds (as added in the previous commit) may store trees
as an array and thus the `fix-colorings.py` script needed to be adjusted
accordingly.
@trvrb trvrb force-pushed the restrict-tree-via-date-cutoff branch from d981566 to 0b38fa5 Compare April 30, 2022 23:41
@jameshadfield
Copy link
Member Author

Like another version of the date slider, where instead of a 'vertical' cutoff (all being along the same line) it was a 'tip' cutoff (like the above) where it then only shows back to the coalescent point of tips in the specified period.

For sure, we want something like this (for multiple reasons) and I've sketched out a few ideas but nothing seems quite right at the moment in terms of UI. Note that the coalescent is often further back than one might think (e.g. the coalescent for tips in the last 6 months of nCoV often goes back to early 2020, depending on the subsampling).

I'm interested in pushing this subtree idea because as we start to sample X* lineages I think this is the best way to present the data. I'd argue it's how omicron should be presented as well. In terms of focusing on temporal ranges however, we may be able to get there by auspice improvements and/or labelling nodes such that we can explode the tree appropriately within auspice.

trvrb added a commit that referenced this pull request Oct 13, 2022
Draft commit to prune outgroup as (near) final step in workflow borrowing from @jameshadfield's earlier work on #934

DO NOT MERGE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: In Review
Development

Successfully merging this pull request may close these issues.

None yet

2 participants