Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf: Invalid FORMAT key in passed example #689

Open
zaeleus opened this issue Dec 14, 2022 · 4 comments
Open

vcf: Invalid FORMAT key in passed example #689

zaeleus opened this issue Dec 14, 2022 · 4 comments
Assignees
Labels

Comments

@zaeleus
Copy link

zaeleus commented Dec 14, 2022

The last record in test/vcf/4.3/passed/passed_body_format.vcf has a FORMAT key named G%3AS (percent-decoded to G:S), which is an invalid identifier. From The Variant Call Format Specification: VCFv4.3 and BCFv2.2 (2022-08-22) § 1.6.2 "Genotype fields":

colon-separated FORMAT keys matching the regular expression ^[A-Za-z_][0-9A-Za-z_.]*$

@d-cameron
Copy link
Contributor

d-cameron commented Dec 15, 2022

Agreed.

This touches on a two larger problems with percent encoding in VCF:

  • S1.2 doesn't explicitly state where percent-encoded values are allowed
  • % itself is not a reserved value

This results in a parsing ambiguity as an INFO field can have the form INFO_KEY=%25 and it is ambiguous as to whether the intended value is a single % or the literal string %25. IIRC, the intent was for, whereever percent encoding is supported, for colons, semicolons, equal, percent, common, CR, LF, and TAB, to be encoded as %3A, %3B, %3D, %25, %2C, %0D, %0A, %09 respectively, and for exactly these 9 strings to be decoded to their corresponding character. This allows

Missing from the specs are 1) an explicit list of where percent encoded can/cannot be used, 2) an explicit list of what does/does not require percent encoding, and 3) what to do other values that looks percent-encoded

IIRC, the intent was for
(1) - just the INFO and FORMAT field values that needed encoding (the specs only explicitly mention encoding in the 1.6.1.8INFO section).
(2) - encode/decode all 8 reserved values
(3) - treat everything else as literals (including % not followed by one of the 8 possible encoding) so as to maximises backward compatibility with 4.2. That is, VCF percent-encoding is the equivalent of running 8 string replaces when parsing/encoding.

@d-cameron
Copy link
Contributor

The other interpretation of the specs is that percent-encoding works on any of the 8 reserved characters anywhere in a VCF file. I'm less keen on this interpretation as it's really not needed elsewhere except if you want to use contig names with characters that are no reserved in SAM but are in VCF.

@jkbonfield jkbonfield added the vcf label Feb 2, 2023
@d-cameron d-cameron self-assigned this Apr 25, 2023
@d-cameron
Copy link
Contributor

Rereading the specs, s1.2 could do with a bit more clarification. Namely:

  • Percent encoding is ONLY used for reserved characters
  • Explicitly state which fields can/cannot be percent encoded. E.g. what about percent encoding of " in the headers? Should these be decoded or treated as literal percentages? What about reserved characters in CHROM, ID or FILTER?
  • A % that does not decode to reserved character should be treated as a literal % (so we don't break VCF with values like KEY=97% in them (which I've seen in the wild).
    • Should this be for only reserved character in that context or the 8 reserved characters listed (only really impacts colon and semi-colon).

d-cameron pushed a commit that referenced this issue Jun 20, 2023
@jmarshall
Copy link
Member

Previously @d-cameron wrote:

[…] except if you want to use contig names with characters that are no reserved in SAM but are in VCF.

Which characters do you have in mind that are reserved in one but not the other? As far as I am aware, the rules are aligned between SAM and VCF. The only potential difference I am aware of is #711, on which your opinion would be appreciated.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To do (backlog)
Development

No branches or pull requests

4 participants