You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is a pretty fun bug but a bit hard to explain. Thankfully very easy to fix.
When using VCF inputs to augur ancestral we need a VCF file and a reference FASTA file. The expectation when we read this VCF file is that the REF allele matches the corresponding base(s) in the FASTA, but this is not asserted.
The most (I think) commonly used software which produces VCF files from large alignments is not aware of what the reference is, and thus given a position with segregating alleles {X,Y} it (presumably) chooses the most common as the REF allele. We can then end up in a situation for a given position where:
The reference FASTA has T
We have an alignment with strain1=A, strain2=T
The VCF file has REF=A, ALT=T, and genotypes of strain1=0, strain2=1
When we (TreeTime) read the VCF the only information we record is strain2 = T
When Augur ancestral reconstructs the mutations across the tree, it uses the FASTA reference at the root (T) + the SNP information (strain 2 has a mutation to T). We thus don't have any variation at this position and lose the A allele completely.
Possible solution
The simple solution, which I'll implement and tag this issue in, is for TreeTime's read_vcf to verify that the REF in the VCF is the same as the corresponding nucleotides in the FASTA.
However we also need a way to go from a large alignment to a VCF with reference bases matching a given FASTA. It shouldn't be hard to write a script to do this, but a bit of a pain.
Real world example
In the Cholera alignment I'm using to debug VCF-related things this occurs for 131 positions in the genome
The text was updated successfully, but these errors were encountered:
jameshadfield
changed the title
[ancestral] VCF parsing error resulting in missing variation
[ancestral] VCF REF != reference base resulting in missing variation
Dec 20, 2023
Current Behavior
This is a pretty fun bug but a bit hard to explain. Thankfully very easy to fix.
When using VCF inputs to
augur ancestral
we need a VCF file and a reference FASTA file. The expectation when we read this VCF file is that the REF allele matches the corresponding base(s) in the FASTA, but this is not asserted.The most (I think) commonly used software which produces VCF files from large alignments is not aware of what the reference is, and thus given a position with segregating alleles
{X,Y}
it (presumably) chooses the most common as the REF allele. We can then end up in a situation for a given position where:T
strain1=A, strain2=T
REF=A, ALT=T
, and genotypes ofstrain1=0, strain2=1
strain2 = T
T
) + the SNP information (strain 2 has a mutation toT
). We thus don't have any variation at this position and lose theA
allele completely.Possible solution
The simple solution, which I'll implement and tag this issue in, is for TreeTime's
read_vcf
to verify that the REF in the VCF is the same as the corresponding nucleotides in the FASTA.However we also need a way to go from a large alignment to a VCF with reference bases matching a given FASTA. It shouldn't be hard to write a script to do this, but a bit of a pain.
Real world example
In the Cholera alignment I'm using to debug VCF-related things this occurs for 131 positions in the genome
The text was updated successfully, but these errors were encountered: