Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerant VCF parsing #156

Open
drtconway opened this issue Aug 22, 2023 · 3 comments
Open

Tolerant VCF parsing #156

drtconway opened this issue Aug 22, 2023 · 3 comments

Comments

@drtconway
Copy link

G'day. Thanks for making a nice tool.

I'm trying to use vcfanno (0.3.5, linux binary) with a large combined VCF of gnomad v3.1. The combined bgzipped file is ~2TB, so obviously manipulating it is inconvenient at best.

I don't know if these are standard in the gnomad downloads, but vcfanno is aborting:

$ ./vcfanno_linux64 config.toml x.vcf.gz > y.vcf

=============================================
vcfanno version 0.3.5 [built with go1.19.3]

see: https://github.com/brentp/vcfanno
=============================================
vcfanno.go:116: found 6 sources from 1 files
vcfanno.go:146: using 2 worker threads to decompress bgzip file
api.go:796: header error in extra field: VEP version: v101. [line: 914]
header error in extra field: dbSNP version: b154. [line: 915]
$

The offending lines in the gnomad VCF are:

##VEP version: v101
##dbSNP version: b154

For reference, the config.toml I am using is:

[[annotation]]
file="/hpc/genomeref/hg38/annotation/gnomad/gnomad.genomes.v3.1.sites.combined.vcf.bgz"
# ID and FILTER are special fields that pull the ID and FILTER columns from the VCF
fields = [ "ID", "FILTER", "AC", "AN", "AF", "popmax" ]
ops    = [ "self", "self", "self", "self", "self", "self" ]
names  = [ "gnomad_ID", "gnomad_FILTER", "gnomad_AC", "gnomad_AN", "gnomad_AF", "gnomad_popmax" ]

[[postannotation]]
fields=["ANN"]
op="delete"

Any chance that the VCF parsing could be made a bit more tolerant for headers? It would be pretty painful to have to modify the GnomAD VCF.

Tom.

@brentp
Copy link
Owner

brentp commented Aug 22, 2023

Hi, can you show me where to find this? I looked in this one: https://storage.googleapis.com/gcp-public-data--gnomad/release/3.1.2/vcf/genomes/gnomad.genomes.v3.1.2.sites.chrY.vcf.bgz
and don't see those values

@brentp
Copy link
Owner

brentp commented Aug 22, 2023

The spec: https://samtools.github.io/hts-specs/VCFv4.2.pdf
says:

1.2 Meta-information lines
File meta-information is included after the ## string and must be key=value pairs.

So, I get that you want more lenient parsing, but do other parsers handle this? And is it something added at your institution? Or from the original gnomad files?

@drtconway
Copy link
Author

Thanks for the fast response!

Yeah, I get that it's non-conformant. I like standards, and I think they are important, so I am sympathetic to the "your data is drunk. Come back when it's sober!" argument.

Especially when it comes to the metadata, I think there are two kinds of non-conformance. In some cases the non-conformance leads to a situation where the program can't figure out how to produce correct output. In other cases the problem is essentially cosmetic and is orthogonal to the production of correct output.

I am pretty sure the file is derived from the individual chromosome files by running VEP and concatenating them, but I don't know the precise provenance. I'm still trying to find out.

The Python and Rust libraries I use (and the C++ I've written) ignore non-conformant meta lines with a warning when reading, but scrupulously make sure they only emit conformant data.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants