Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

multi allelic vcf: vcfanno is not respecting VCF number format and is flipping scores #87

Open
RoanKanninga opened this issue Jun 5, 2018 · 7 comments

Comments

@RoanKanninga
Copy link

This one is quite complex to explain, so i will start with an example
This is in my header
CADD,Number=1
CADD_SCALED,Number=A

When I have a multiallelic variant let say:
1 208063100 rs5780411 G GA,T

I would expect that CADD_SCALED has two values and CADD only one value.
This is correct when my file with the CADD/CADD_SCALED scores only contains this position once, when (in case of the cadd scores you will get scores for each ALT allele) you have multiple lines containing the same position but different ALT alleles it is going all wrong.
although CADD,Number=1, the CADD info field has now 2 values (for each ALT allele), and the scores has been flipped, the CADD score for ALT allele 1 has now the value of ALT allele 2 and vice versa

I included: input(input.vcf), output(annotated.vcf), conf(conf.toml) and annotationsfile (whole.vcf.gz + index)
vcfAnno.tar.gz

@brentp
Copy link
Owner

brentp commented Jun 5, 2018

I see what you mean. It should use CADD,Number=A. I would simply change your whole.vcf.gz to use Number=A for raw. I would accept a PR to make vcfanno detect a case like this (though I'm not sure how because the header is written before any variants are observed), but I do not intend to fix myself since this is an edge-case that is easily avoided and/or fixed.

The other "fix" would be to simply set Number=A whenever op=self but I don't like that solution either. I think what's there now is a good trade-off with usability (getting a single number in most cases) and completeness. I am open to hearing other ideas.

@garrettjstevens
Copy link
Contributor

What about printing a warning when writing multiple values when the (previously written) header was Number=1? I ask mostly because of the other point raised in this issue (the "scores have been flipped" mentioned), namely when the VCF with the annotations has CADD,Number=1 the variants get annotated as CADD_SCALED=24,0.6;CADD=-0.3,3, but when you change it to CADD,Number=A the CADD scores get flipped to the (correct) CADD_SCALED=24,0.6;CADD=3,-0.3 (-0.3 and 3 are in a different order). A warning might be nice since having the scores not in the same order as the ALTs is probably not an expected outcome of having Number=1.

@brentp
Copy link
Owner

brentp commented Jun 5, 2018

somehow I missed that the alleles were flipped. that is indeed a bug. I'm looking into this and the other issue raised by @RoanKanninga now.

@brentp
Copy link
Owner

brentp commented Jun 5, 2018

After much messing about, this is going to have to be indicated as a WARNING. I thought I could magically adjust the order, but this changes the behavior in cases where Number=1 is actually what is desired. I'll push a fix shortly once I have the other issue resolved.

brentp added a commit that referenced this issue Jun 5, 2018
re #83 and #87

if there are already values in the query info field for a variant with
multiple alternates, incoming values will only overwrite existing values
if they are non-nil (or non-zero values of the type).

thanks @RoanKanninga for reporting and providing test-cases.

when Number=1 in the annotation file (and therefore the input file)
and there are multiple alternates in the input file, the values
can be out of order. This now issues a warning indicating the file
and the field in question and noting that it can be mitigated by
decomposing the input file.
brentp added a commit that referenced this issue Jun 5, 2018
re #83 and #87

if there are already values in the query info field for a variant with
multiple alternates, incoming values will only overwrite existing values
if they are non-nil (or non-zero values of the type).

thanks @RoanKanninga for reporting and providing test-cases.

when Number=1 in the annotation file (and therefore the input file)
and there are multiple alternates in the input file, the values
can be out of order. This now issues a warning indicating the file
and the field in question and noting that it can be mitigated by
decomposing the input file.
@RoanKanninga
Copy link
Author

Hi Brent,
thanks for all the work.
My example is probably a bit confusing, since the CADD has 2 different scores for the 2 ALT alleles.

What I really want is just one value for my Number=1 field
My real case is this:
Header:
AN,Number=1

1 123456 . A C,G AN=24,24

INFO field called AN, that should always be Number=1, since this is the total amount of all the alleles. But what I now see in my data is e.g. AN=24,24 instead of AN=24.
My annotations source file contains (like the cadd annotations file) for each ALT allele an AN value like this:
1 123456 . A C AN=24
1 123456 . A G AN=24

So this is my real problem.
a downstream inhouse tool we are using is complaining that 24,24 is not an INT (and he is correct since it expects 1 value and not multiple), I can make a workaround for this to make the AN field Number=A, but that is suboptimal

@garrettjstevens
Copy link
Contributor

Can you use "first" instead of "self" in the ops field of the config file? That should grab only a single value instead of multiple.

@brentp
Copy link
Owner

brentp commented Aug 30, 2018

some of this addressed in the latest release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants