Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Other misc. VCF test data queries. #608

Open
jkbonfield opened this issue Nov 11, 2021 · 16 comments
Open

Other misc. VCF test data queries. #608

jkbonfield opened this issue Nov 11, 2021 · 16 comments
Labels

Comments

@jkbonfield
Copy link
Contributor

jkbonfield commented Nov 11, 2021

Some of this is an issue for the data, but some also needs some major clarifications in the spec.

  • Is contig "<1>" just a typo here:
@ seq4-head2[4.3/passed]; awk '!/^#/ {print $1,$2}' complexfile_passed_000.vcf
1 10583
<1> 10611
1 30923
1 46402
...
  • Illegal? header lines in complexfile_passed_000.vcf (also passed_meta_pedigreedb.vcf)

The VCF spec carefully lists the syntax of specific line types. These are what VCF calls "structured lines", but it's less clear on non-structured data. The spec mentions structured lines, but I could find no definition for unstructured. Is the opposite of structured "meta-information"? If so the spec doesn't say much about them other than that they are key/value and must be "well formed", but sadly doesn't explain what that means. That needs fixing. I assume they must still follow the same quoting rules.

Which leads me on to meta-characters. Some fields are quoted, and others not. Eg from the spec:

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

Here Description is quoted, presumably because it may contain spaces, or more importantly commas. Does it need to be quoted? Could Description=Genotype be used? Could we say Type="Integer"? There is no grammar or even a regexp for how to parse headers, which is a major problem.

So, on to the specifics which tripped up bcftools:

##AnalysisTitleBrackets=<"FINRISK: Whole-exome sequencing of Dietary, life style, and genetic determinants of obesity and metabolic syndrome (DILGOM)">

This one is entirely quoted, but it has no = in there at all. Hence it also fails the requirement of being key=value.

##pedigreeDB=<ftp://user@host:8080/path/to/pedigreeDB?arg1=db1>

Now this line does have an equals sign, so technically matches the key=value requirement. It clearly wasn't intended to be one though, but why is it failing? I assume bcftools dislikes some characters in the key, eg : or ?. What characters are permitted in the keys? When would we need to quote it? Can we even quote it if we desired to? What characters must be quoted. Obviously comma, and maybe <>, but = too? I don't see why it would for a value, but clearly it would for the key and we'd probably not want to have differing quote requires for both.

Either way, I believe these two lines to be illegal. In addition to the spec needing much more clarification as to what is acceptable.

  • All failed/*.vcf have ##CauseOfFailure= lines with no key=value formatting.

To be clear, the specification states that "It contains meta-information lines (prefixed with ##)" and "File meta-information is included after the ## string and must be key=value pairs."

I know they're failures, but the failure shouldn't be the thing describing the failure, as that makes validating that we fail rather hard (we fail for other reasons). They should all be of the form ##CauseOfFailure=<Reason="text">.

@jkbonfield
Copy link
Contributor Author

I'd also argue that we need bcf-comaptible copies of these files. The lack of INFO and Contig lines means it's currently not possible to do round-trip tests. We should have both minimal and fully-specified variants for each file to permit implementers to validate their BCF code.

@jkbonfield
Copy link
Contributor Author

Looking at genuine VCFs, I see 1000 genomes data containing:

##reference=file:///gpfs/commons/datasets/old-nygc-resources/GRCh38_1000genomes/GRCh38_full_analysis_set_plus_decoy_hla.fa

So maybe we've been accepting for years VCF headers that aren't key=value. In that case the specification needs changing to permit them, instead of explicitly forbidding them as it does now.

@jmarshall
Copy link
Member

That last one's easy: ##reference=file:///gpfs…plus_decoy_hla.fa is an unstructured meta-information line. The VCF spec says very little about unstructured header lines (and it could probably stand to say more). That one has key reference and value file:///gpfs…plus_decoy_hla.fa, so it fits the rule of “File meta-information is included after the ## string and must be key=value pairs” just fine.

“Unstructured header lines” are the ones that aren't structured; and “structured” ones are those that have <> around the value. Structured header lines have all the additional rules about how the subvalue fields within the angle brackets behave (in particular, they need to be a list of key=value pairs too — and that coincidence, not mentioned in the text, isn't helping the comprehension here). To be sure, the spec doesn't really just come out and say that…

That's how I interpret this verbiage anyway. It could certainly be more clearly written and comprehensive. The VCF spec has historically been fluffy enough that it needs a non-adversarial reader happy to constructively fill in the blanks and interpret the vagueness in a way consistent with extant VCF files… (It would be nice to improve that obvs, and you've listed a number of good header-related candidates for more clarity…)

@jmarshall jmarshall added the vcf label Nov 11, 2021
@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 11, 2021

Oh, so when it says key = value it means a single ##key=value - singular value, and not ##name=<(key=value)*>? I assumed as it said "pairs" (pural) it was the latter, rather than the former. How does the parser distinguish between a line where the bit after equals is a single value vs a line where that bit is a key-value list? Is that what the <> mean?

Ie does it have some nested syntax where ##key=value is a single key, but value can be < (key=value,)* >, permitting key-value pairs themselves to be values?

Frankly the entire introduction needs rewriting, preferably by someone who understands VCF (ie not me). Perhaps starting with what all the meta-characters actually mean. It's clear as mud right now. :(

Edit: also before posting this issue, I'd already searched for unstructured and non-structured. It doesn't exist. I don't know where the distinction between structured and unstructured lines exists in the spec.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 11, 2021

So going back to the two entries I see that cause bcftools breakage, is it the case that any "<" means structured and absence of "<" means unstructured? If so I assume it's failing because it's attempting to apply structured logic to an unstructured entry, and the error here is that those VCF header lines should be rewritten to exclude the angle brackets.

Edit: Yes! I see All structured lines that have their value enclosed within ``<>'' require an ID which must be unique within their type. I'm guessing it's implicit that lack of <> means unstructured. What's the syntax checking on unstructured? Anything? Ie value is [^<][:isprint:]*$ ?

I wonder though how we got them in the first place. I thought all of these test files were a product of the VCF syntax checker, in which case at least one person has interpretted the spec differently. :)

@jmarshall
Copy link
Member

Yep. I assumed “pairs” (plural) was because it was talking about multiple meta-information lines, or specifically the set of all the meta-information lines in a VCF file. That sentence:—

File meta-information is included after the ## string and must be key=value pairs.

has existed unchanged in the spec since the 2013 G1K wiki spec page, which is before the formalism about structured lines was introduced.

It is agreed (by me) that this is fabulously muddily written.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Nov 11, 2021

Following on from the conference call, with thanks to John for identifying the relevant text:

In answer to the first point above, "<1>" appears to be deliberate. In the spec we see:

https://github.com/samtools/hts-specs/blob/master/VCFv4.3.tex#L296-L297

  CHROM --- chromosome: An identifier from the reference genome or an angle-bracketed ID String "<ID>" pointing
  to a contig in the assembly file (cf. the ##assembly line in the header).
  All entries for a specific CHROM must form a contiguous block within the VCF file.

The second sentence there however is broken by this example, as we go from "1" to "<1>" back to "1". All entries are sequential positions, with "<1>" being an alternate allele part of chr "1". I believe the example considers all CHROM being a contiguous block as it equates 1 and alternate-allele-of-1 to be the same thing, but in practice that's challenging to handle.

Regarding the second problem above, I notice this file also contains:

##PEDIGREE=<ID=Pedigree1,Original=Something>

The "other" pedigree syntax. ;-)

The spec lists that syntax as well as ##pedigreeDB=URL. Note the lack of angle brackets. I think the use of <URL> is an error in the test file.

@jmarshall
Copy link
Member

jmarshall commented Nov 12, 2021

##PEDIGREE is a related header for specifying pedigree databases inline in the VCF file. It's a bit better described in the spec, but see #381 and #413 et al which note that it may or may not actually be much used in the wild and needs either removing or further describing.

On ##pedigreeDB which points to a separate .ped file (presumably), the plot thickens…

The original G1K wiki VCF ur-specs and the 4.1 and 4.2 documents have two instances of ##pedigreeDB both of which say:

or a link to a database:

##pedigreeDB=<url>

[…]
Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided:

##pedigreeDB=<url>

The 4.3 document is similar, but f247cee changed one of the instances:

[§1.4.9] or a link to a database:

##pedigreeDB=URL

[§5.4.11] Alternately, if data on the genomes is compiled in a database, a simple pointer can be provided:

##pedigreeDB=<url>

Reading the tea leaves, it seems that PR #88 (which contained that commit) was making the claim that these angle brackets were merely metasyntactic variable notation not literal characters like the # and = signs. And that it inadvertently neglected to update the other instance. (Yes, it's totally not awesome that the spec has duplicate sections defining this stuff!)

My feeling is that that claim is surely correct and the other instance should be similarly updated (or, since we're a typeset document, we could I dunno use italics or something 😄). And hence the test files are indeed erroneous.

However as usual it really comes down to how ##pedigreeDB is used in the wild, if at all. Possibly @reedacartwright has some clues on that one — #88 (comment) makes him one of the few people on record as having tried to use this stuff…

@reedacartwright
Copy link

I've never used ##pedigreeDB in my tools, only ##pedigree tags.

If I had to pick, I'd ditch the existing VCF/BCF header format and replace it with YAML front matter. Where VCF format would look something like:

---
# Yaml formatted header meta data
---
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT NA00001 NA00002 NA00003

@d-cameron
Copy link
Contributor

The second sentence there however is broken by this example, as we go from "1" to "<1>" back to "1". All entries are sequential positions, with "<1>" being an alternate allele part of chr "1". I believe the example considers all CHROM being a contiguous block as it equates 1 and alternate-allele-of-1 to be the same thing, but in practice that's challenging to handle.

1 and <1> are definitely different thus mixing them voilates VCF ordering. There's no CHROM equivalences in the specs. In practice alt contig coordinates won't match primary, nor will there only be one alt contig so the feature implied in that example would be more of a hinderance than a benefit.

@jmarshall
Copy link
Member

The VCF spec carefully lists the syntax of specific line types. These are what VCF calls "structured lines", but it's less clear on non-structured data. The spec mentions structured lines, but I could find no definition for unstructured. Is the opposite of structured "meta-information"?

PR #620 attempts to clarify this section, by rewording it as outlined in #608 (comment).

@jmarshall
Copy link
Member

##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">

Here Description is quoted, presumably because it may contain spaces, or more importantly commas. Does it need to be quoted?

§1.4.2 states that for ##INFO, Description must be quoted.

§1.4.4 is not explicit, but states that ##FORMAT is “otherwise defined precisely as the INFO field“. So I would conclude that Description here is indeed required to be quoted too.

IMHO §1.4.3 and §1.4.4 could stand to be a bit more explicit…

@zaeleus
Copy link

zaeleus commented Feb 8, 2023

Could we say Type="Integer"?

With field ordering now clarified as insignificant in VCF 4.4, this is an outstanding question that needs to be resolved. The consequence of unordered fields increases the complexity of parsing otherwise.

What's the syntax checking on unstructured? Anything? Ie value is [^<][:isprint:]*$ ?

I'd also like this confirmed. Unfortunately, the spec reuses the same symbols for unstructured and structured lines (##key=value and ##key=<key=value,key=value,key=value,...>), which implies for an unstructured line, value can also be quoted and is not part of the value, i.e, the value of ##comment="hts-specs" is hts-specs, not "hts-specs".

@pd3
Copy link
Member

pd3 commented Feb 10, 2023

The format is parsimonious, it should be Type=Integer, not Type="Integer". Quoting was introduced for Description because it can contain spaces, as @jmarshall pointed out. Since there are only few single word types, quoting Type wouldn't add anything.

Similarly, ##comment="something" does not add any information and there is no danger of ambiguity, so shouldn't be quoted.

@jkbonfield
Copy link
Contributor Author

jkbonfield commented Feb 10, 2023

The fact we're having this conversation though indicates the spec is ambiguous, or at the very least unclear. It should (but is not) be "obvious" to a naive reader. The obvious interpretation I'd take is that values may be quoted, and that presumably there is some quote escape mechanism because that's a logical corrolary of adopting quoting. (This interpretation however would be wrong.)

What's certainly not obvious from a naive not-reading-the-spec fashion is that Description must be quoted. Even reading the spec, I still don't know if it's true that eg Type must not be quoted. What happens if we do? The spec needs to be much more explicit on these things.

All I can do is read between the lines, which gives me the gut feeling there are known keys and unknown (optional) keys. Optional keys have no known types, and are listed as requiring quoting, even if numeric. This implies the parser doesn't determine type by content but by decoration. Known keys have hard specified types, and as determined already the parser cannot intuit a type by content. From this I would assume the parsers would hard fail if a controlled vocabulary matched "Integer" instead of Integer, and that know keys must not be quoted (unless explicitly stated otherwise, in which case they must always be quoted).

That's pure guesswork though. The spec needs to remove the need for such speculation.

@d-cameron
Copy link
Contributor

21 June meeting proposed clarification:

  • Unstructured: allow anything after the =
  • Structured (map of key=value pairs)
    • ID, Number, Type never quoted
    • Anything else optionally quoted (alternative: always quoted)
    • Disallow quotes anywhere else
    • Require quotes if value contains , or >
    • No percent encoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To do (backlog)
Development

No branches or pull requests

6 participants