Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incompatabilities between augur tree and augur ancestral #1360

Open
jameshadfield opened this issue Dec 15, 2023 · 0 comments
Open

Incompatabilities between augur tree and augur ancestral #1360

jameshadfield opened this issue Dec 15, 2023 · 0 comments
Labels
bug Something isn't working please take this issue Extra attention is needed priority: high To be resolved before other issues

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Dec 15, 2023

Current Behavior

Augur tree will perform s/#/_/g replacement on strain names in a completely silent manner. This will leave the fasta alignment and the tree out-of-sync as the strain names no longer match, which will have downstream consequences. For instance, the following bugs arise when using the data in augur ancestral:

  1. If over 30% of terminal nodes have a name mismatch (i.e. over 30% of strain names had a # in them) then augur ancestral prints a warning and exits code 0. It's not at all obvious that the reason for this mismatch is ultimately augur tree. Additionally, because it exits code 0 (without producing the output JSON) other commands may run further obscuring the actual error.
  2. If fewer than 30% of tips had a # in their name then the commands will proceed; warnings will be printed but these are often lost in the sea of output that a pipeline produces. augur ancestral will (by default) infer the sequences on the missing tips. The end result is that we will end up missing mutations in a dataset or asserting that a genotype is present in the data when in fact the sequence data says otherwise. This is a much more serious bug than (1).

Here I've focused on augur ancestral but the underlying strain name manipulation will have consequences for any command which reads the tree + alignment.

Expected behavior

A series of augur commands should work together, we shouldn't cause data to become out-of-sync for the users.

Additionally, when augur ancestral encounters a problem it should error with a non-zero code. Update: a TreeTime error now causes a non-zero exit code via #1367, but this doesn't address the TreeTime warnings described below which don't cause an exit.

How to reproduce

Steps to reproduce the behaviour in case (1) above:

a. Create sequences.fasta

>seq#1
AAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq#2
AAAAAAAACAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq_3
AAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

b. augur tree --alignment sequences.fasta --output tree.nwk
c. augur refine --keep-root --tree tree.nwk --output-tree tree.refine.nwk
d. augur ancestral --tree tree.refine.nwk --alignment sequences --output-node-data nt_muts.json
The output will look like

0.00    ***WARNING: TreeAnc._check_alignment_tree_gtr_consistency: NO SEQUENCE FOR
        LEAF: 'seq_1'

0.00    ***WARNING: TreeAnc._check_alignment_tree_gtr_consistency: NO SEQUENCE FOR
        LEAF: 'seq_2'
ERROR: TreeAnc._check_alignment_tree_gtr_consistency: At least 30\% terminal nodes cannot be assigned a sequence!
Are you sure the alignment belongs to the tree?

However the exit code is 0, and nt_muts.json is not produced.

To recreate case (2):

Repeat the above steps, but using sequences.fasta

>seq#1
AAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq_2
AAAAAAAACAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq_3
AAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
>seq_4
AAAAAAAAAAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

This time the augur ancestral will show the following warnings:

0.00    ***WARNING: TreeAnc._check_alignment_tree_gtr_consistency: NO SEQUENCE FOR
        LEAF: 'seq_1'

0.00    ***WARNING: TreeAnc: 1 nodes don't have a matching sequence in the
        alignment. POSSIBLE ERROR.
ancestral mutations written to nt_muts.json

Inspecting the produced nt_muts.json we see that a sequence for seq_1 has been inferred and it does not match the sequence of the input data seq#1:

seq#1 (input data)      AAAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
seq_1 (augur ancestral) AAAAAAAACAAATAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA

Possible solution

  • Augur tree exit upon discovery of # in strain names, and provide a simple command/script to modify the input data appropriately. (There may be other characters augur tree modifies too, but # is the one I've encountered.)

  • augur tree could modify the output tree file to reinstate the replaced characters

  • Augur ancestral should exit (with a non zero exit code) if any sequences are missing - otherwise how was the tree created with that strain in it?

Your environment: if running Nextstrain locally

augur 23.1.1

@jameshadfield jameshadfield added bug Something isn't working priority: high To be resolved before other issues please take this issue Extra attention is needed labels Dec 15, 2023
jameshadfield added a commit that referenced this issue Dec 20, 2023
Current behaviour is to exit code 0 ("success") if augur encounters
`TreeTimeError`. The initial implementation
(3256319) exited code 2 but this was
removed by 403da88

This bug was first observed (by me, at least) as part of
#1360
jameshadfield added a commit that referenced this issue Dec 20, 2023
Current behaviour is to exit code 0 ("success") if augur encounters
`TreeTimeError`. The initial implementation
(3256319) exited code 2 but this was
removed by 403da88

This bug was first observed (by me, at least) as part of
#1360
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working please take this issue Extra attention is needed priority: high To be resolved before other issues
Projects
None yet
Development

No branches or pull requests

1 participant