Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using vcfanno to annotate SV calls with GNOMAD SV frequencies #106

Open
naumenko-sa opened this issue Mar 18, 2019 · 10 comments
Open

Using vcfanno to annotate SV calls with GNOMAD SV frequencies #106

naumenko-sa opened this issue Mar 18, 2019 · 10 comments

Comments

@naumenko-sa
Copy link

Hello, Brent!

Thanks for the very useful vcfanno tool!

I'd like to start using it for annotation of structural variants (SV) in the same way how it works for small variant vcf files.

Recently GNOMAD released its SV frequencies, and I gave it a try:

  • minimal conf and lua files that you are using.
    crg.sv.vcfanno.conf:
[[annotation]]
file="gnomad_v2_sv.sites.vcf.gz"
fields=["popmax_af"]
names=["gnomad_sv_popmax_af"]
ops=["self"]

no lua

=============================================
vcfanno version 0.3.1 [built with go1.11]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:115: found 1 sources from 1 files
vcfanno.go:143: using 2 worker threads to decompress query file
api.go:804: WARNING: using op 'self' when with Number='1' for 'popmax_af' from 'gnomad_sv.vcf.gz' can result in out-of-order values when the query is multi-allelic
api.go:805:        : this is not an issue if the query has been decomposed.
bix.go:450: no end: gnomad_sv.vcf.gz 1 9999 N <DUP>
bix.go:450: no end: gnomad_sv.vcf.gz 1 14499 N <DUP>
bix.go:450: no end: gnomad_sv.vcf.gz 1 20649 N <DUP>
bix.go:450: no end: gnomad_sv.vcf.gz 1 20999 N <DEL>
bix.go:450: no end: gnomad_sv.vcf.gz 1 39999 N <DUP>
bix.go:450: no end: gnomad_sv.vcf.gz 1 40949 N <DEL>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=0>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=1>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=2>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=3>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=4>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=5>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=6>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=7>
bix.go:450: no end: gnomad_sv.vcf.gz 1 45999 N <CN=8>
bix.go:450: no end: gnomad_sv.vcf.gz 1 50940 N <DEL>
bix.go:450: no end: gnomad_sv.vcf.gz 1 62398 N <DEL>
bix.go:450: no end: gnomad_sv.vcf.gz 1 66349 N <DEL>
bix.go:450: no end: gnomad_sv.vcf.gz 1 93999 N <CN=0>
bix.go:450: no end: gnomad_sv.vcf.gz 1 93999 N <CN=1>
bix.go:450: no end: gnomad_sv.vcf.gz 1 93999 N <CN=2>
bix.go:450: no end: gnomad_sv.vcf.gz 1 93999 N <CN=3>
bix.go:450: no end: gnomad_sv.vcf.gz 1 93999 N <CN=4>
bix.go:450: no end: gnomad_sv.vcf.gz 1 93999 N <CN=5>
bix.go:450: no end: gnomad_sv.vcf.gz 1 93999 N <CN=6>
bix.go:450: no end: gnomad_sv.vcf.gz 1 118398 N <DEL>
bix.go:450: no end: gnomad_sv.vcf.gz 1 123349 N <CN=0>
bix.go:450: no end: gnomad_sv.vcf.gz 1 123349 N <CN=1>
bix.go:450: no end: gnomad_sv.vcf.gz 1 123349 N <CN=2>
bix.go:450: no end: gnomad_sv.vcf.gz 1 123349 N <CN=3>
bix.go:450: no end: gnomad_sv.vcf.gz 1 123349 N <CN=4>
bix.go:450: no end: gnomad_sv.vcf.gz 1 123349 N <CN=5>
bix.go:450: no end: gnomad_sv.vcf.gz 1 123349 N <CN=6>
bix.go:450: no end: gnomad_sv.vcf.gz 1 136999 N <DUP>
bix.go:450: no end: gnomad_sv.vcf.gz 1 141474 N <DUP>
bix.go:450: no end: gnomad_sv.vcf.gz 1 149999 N <DEL>
bix.go:450: no end: gnomad_sv.vcf.gz 1 151999 N <DUP>
bix.go:450: no end: gnomad_sv.vcf.gz 1 156999 N <DUP>
vcfanno.go:241: annotated 1 variants in 0.00 seconds (545.9 / second)

There is a result file: test.annotated.vcf.gz, but it does not contain the gnomad_sv_popmax_af annotation.

What would be the algorithm of SV matching in vcfanno? For small variants it is just chr:pos:ref:alt, but for SVs it is different: min 50% reciprocal overlap between features (END-START) or similar.

Sergey

brentp added a commit to brentp/bix that referenced this issue Mar 18, 2019
@brentp
Copy link
Owner

brentp commented Mar 18, 2019

I have pushed a fix for this. Would you try the attached binary?
vcfanno_linux64_fix.gz

@brentp
Copy link
Owner

brentp commented Mar 18, 2019

you'll need to set

fields=["POPMAX_AF"]

and then you'll probably want -permissive-overlap or it will only get exact matches.

@naumenko-sa
Copy link
Author

Thanks @brentp !

This binary works. However, even with -permissive-overlap option on, annotation happens only when POS in the query is the same as POS in the GNOMAD. When I tried to annotate an SV [158,000-166,000] with gnomad [157,000-166,000] it would not annotate.

Sergey

@brentp
Copy link
Owner

brentp commented Mar 18, 2019 via email

@naumenko-sa
Copy link
Author

naumenko-sa commented Mar 19, 2019

Thanks @brentp!

I've got it - when working with intervals, it is better to annotate a vcf file vs bed file rather than vs vcf.
Still I have some minor questions and suggestions, if you don't mind.

My new config

[[annotation]]
file="gnomad_sv.bed.gz"
columns=[39]
names=["gnomad_sv_popmax_af"]
ops=["self"]
  • query [157000,166000]. It picks up [152000,177400] event rather than [157000,166000], but it is ok, probably, gnomad should have done a better job of merging those overlapping intervals.
  • query [152000,177400] picks up [137000,155500], 3.5k overlap instead of [152000,177400] 25k overlap. Again, a problem with database having overlapping events.

Probably, before using gnomad_sv for annotation we have to 'normalize' gnomad_SV database like we did for small variants. However, SVs are large, and there are rare deletions overlapping with frequent duplications in the population. What is your take on this problem? Would you suggest to split gnomad bed file into several files according to types (DEL, DUP) and annotate separately?

A test of sensitivity of vcfanno:
I picked up a deletion in GNOMAD which did not overlap with anything:
1 50941 51053 gnomAD_v2_DEL_1_3

  • query [50941,51053] with "first" it gives GNOMAD_SV_POPMAX_AF=NA, with "self" it is "NA,0.008589". Why is that?
  • query [50941,51053] with changed type (alt field and SVTYPE) to DUP instead of DEL - matches. So vcfanno does not account for SV type (alt allele in the vcf), just for the interval, which is good, because different caller (in QRY and in the DB) may call the different events.
  • query [50942,51053] - matches
  • query [50998,51053] - overlap 56/112 - matches
  • query [50999,51053] - overlap 55/112 < 50% - matches
  • query [51052,51053] - overlap 2/112 < 1% - matches

Do you think that the last overlap is too permissive and something like % required reciprocal overlap could be added here in vcfanno like in bedtools? I'd happy to contribute here, just point me out to the right place. Another suggestion might be a flag to use an additional INFO field when matching (i.e. SVTYPE in vcf and a certain column in BED).

Or maybe it is possible to allow users to custom calculation on fields before annotation happens, i.e. to define what is matching with lua as it happens with post-annotation?

Thanks!
Sergey

@brentp
Copy link
Owner

brentp commented Mar 19, 2019

Hi, if I understand correctly, you are describing what appear to be bugs in vcfanno--variants that should be getting annotated, but are not. Is that correct? Are you sure both files are sorted and that the file you are annotating with is indexed correctly?
Can you share a small query and annotation file and conf where it does not annotate as expected?

@naumenko-sa
Copy link
Author

naumenko-sa commented Mar 21, 2019

Hi @brentp !

I see two issues. In both annotation happens.
One issue is incorrect annotation, another is too permissive annotation (better to have no annotation there).
To reproduce both issues you may use these files:

[[annotation]]
file="gnomad.bed.gz"
columns=[39]
names=["gnomad_sv_popmax_af"]
ops=["self"]

Command: vcfanno vcfanno_bed.conf test.vcf.gz | bgzip -c > test.annotated.vcf.gz

  1. Annotation is "NA,0.008589" instead of "0.008589".
    When I use 'first' the annotation is 'NA'
  2. Variant 1:51052-51053 in the vcf matches with 1:50941-51053 in the gnomad.bed file.
    It better should not be reported, as overlap is just 2pb out of 113.

Works the same way with and without -permissive-overlap, and vcfanno 0.3.1 and 0.3.2.

Sergey

@ajaarma
Copy link

ajaarma commented May 17, 2019

I guess the problem is related to finding reciprocal overlap and currently most of the existing methods dont deal with it. Has this been resolved yet in vcfanno? Eagerly waiting for it.

@snashraf
Copy link

I am having similar issue. Did someone has an update on this?

@brentp
Copy link
Owner

brentp commented Feb 20, 2020

hi, I won't have time to work on this. I think the machinery is in place inside vcfanno for the annotation which can be followed up with some post-processing to require a certain amount of overlap.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants