Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vcf: Contig name rules contradiction #711

Open
zaeleus opened this issue Mar 26, 2023 · 1 comment
Open

vcf: Contig name rules contradiction #711

zaeleus opened this issue Mar 26, 2023 · 1 comment
Labels

Comments

@zaeleus
Copy link

zaeleus commented Mar 26, 2023

§ 1.4.7 "Contig field format" states

Contig names follow the same rules as the SAM format’s reference sequence names

But subsequently

The contig names must not use a reserved symbolic allele name.

This means that contig names do not follow the rules as SAM. For example, "DEL" is valid name in SAM but not VCF.

@jmarshall
Copy link
Member

jmarshall commented Mar 27, 2023

The “must not use a reserved symbolic allele name” text was added in c50589b as part of PR #88, in response to @d-cameron's #89 (reformatted here):

[ambiguities/parsing issues in] contig names

  1. <DEL> is a valid contig name

    Recommended solution: Valid characters should use the SAM regex of "[!-)+-<>-~][!-~]*" but restricted the following additional characters "<>[]:"

  2. DEL is a valid ID String (which causes issues for SVs such as "A[<DEL>[" )

    Recommended solution: contigs names MUST NOT use a reserved symbolic alternate allele name

The recommended solution to (1) was applied (and later relaxed to allow colons). It says that angle-bracketed strings may not be used as contig names, and is reflected in the current SAM and VCF specs by the rather long RNAME regex — which forbids angle brackets anywhere in contig names.

The recommended solution to (2) was applied as the “must not use a reserved symbolic allele name” text you mention. IMHO it's not clear from the VCF text just what this is intended to forbid: a “reserved symbolic allele name” is surely an angle-bracketed thing, so this could be interpreted as forbidding <DEL> just like (1) does.

However the context of #89's (2) (as quoted above) suggests that this text really is intended to forbid DEL and the like, i.e., non-angle-bracketed strings that happen to have the same text as the string within the angle brackets of some reserved symbolic allele name.

However I don't understand what issue would be caused for breakend notation such as "A[<DEL>[" if an ordinary contig named DEL was also present in a VCF file!

So I think it would be helpful to revisit this and highlight just what issues, if any, would be caused for such SVs. Then we could either clarify or remove this latter text, as appropriate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: To do (backlog)
Development

No branches or pull requests

2 participants