Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ancestral] VCF REF != reference base resulting in missing variation #1368

Open
jameshadfield opened this issue Dec 20, 2023 · 1 comment
Open
Assignees
Labels
bug Something isn't working

Comments

@jameshadfield
Copy link
Member

jameshadfield commented Dec 20, 2023

Current Behavior

This is a pretty fun bug but a bit hard to explain. Thankfully very easy to fix.

When using VCF inputs to augur ancestral we need a VCF file and a reference FASTA file. The expectation when we read this VCF file is that the REF allele matches the corresponding base(s) in the FASTA, but this is not asserted.

The most (I think) commonly used software which produces VCF files from large alignments is not aware of what the reference is, and thus given a position with segregating alleles {X,Y} it (presumably) chooses the most common as the REF allele. We can then end up in a situation for a given position where:

  • The reference FASTA has T
  • We have an alignment with strain1=A, strain2=T
  • The VCF file has REF=A, ALT=T, and genotypes of strain1=0, strain2=1
  • When we (TreeTime) read the VCF the only information we record is strain2 = T
  • When Augur ancestral reconstructs the mutations across the tree, it uses the FASTA reference at the root (T) + the SNP information (strain 2 has a mutation to T). We thus don't have any variation at this position and lose the A allele completely.

Possible solution

The simple solution, which I'll implement and tag this issue in, is for TreeTime's read_vcf to verify that the REF in the VCF is the same as the corresponding nucleotides in the FASTA.

However we also need a way to go from a large alignment to a VCF with reference bases matching a given FASTA. It shouldn't be hard to write a script to do this, but a bit of a pain.

Real world example

In the Cholera alignment I'm using to debug VCF-related things this occurs for 131 positions in the genome

@jameshadfield jameshadfield added the bug Something isn't working label Dec 20, 2023
@jameshadfield jameshadfield self-assigned this Dec 20, 2023
@jameshadfield jameshadfield changed the title [ancestral] VCF parsing error resulting in missing variation [ancestral] VCF REF != reference base resulting in missing variation Dec 20, 2023
@jameshadfield
Copy link
Member Author

This was somehow missed during #1355. This TreeTime branch includes a test highlighting this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant