add reference blocksize and checkpointing to VCF #435

yfarjoun · 2019-07-24T20:14:29Z

This is the reference-block part of #420

hts-specs-bot · 2019-07-24T20:17:06Z

Changed PDFs as of bd61dee: VCFv4.3 (diff).

hts-specs-bot · 2019-07-24T20:27:13Z

Changed PDFs as of fd669b4: VCFv4.3 (diff).

mlin · 2019-07-27T00:58:25Z

@yfarjoun thanks for making the split!
@lbergelson here's my recommendation in a more straightforward form:

In those infrequent cases where a reference block ends and no downstream reference block (or variant) abuts it, leave an explicit indication of "no data" on the next line, rather than having RBS subtly imply it.

The current proposal might write,

POS  FORMAT     Alice
100  GT:DP:RBS  0/0:30:250
200  GT:DP:RBS  .
300  GT:DP:RBS  .
400  GT:DP:RBS  .
500  GT:DP:RBS  .
600  GT:DP:RBS  .

indicating Alice is confidently 0/0 at positions 200 and 300, with no information at the remaining sites. I'd have us write instead,

POS  FORMAT  Alice
100  GT:DP   0/0:30
200  GT:DP   .
300  GT:DP   .
400  GT:DP   ./.:0
500  GT:DP   .
600  GT:DP   .

Reading this is simpler and less error-prone: the interpretation of an empty cell is to look up to the last filled-in cell, with no position check needed. Usually, another reference block (or variant site) picks up where the last one left off, so these "no data" markers would be rare and wouldn't materially inflate the representation. Actually I think it'd be smaller without the RBS values, but we needn't forbid them, of course.

Here are two non-GVCF-merging use cases I think we shouldn't disenfranchise:

Streaming through a sorted BAM file to genotype a given list of variant sites
Converting an existing, non-sparse, multi-sample VCF to the sparse form.

RBS is less convenient for the first application because of the need to see well downtream of the current position being processed. The same is true in the second application and in that one, the RBS information may not even be completely there in the input. I believe the suggestion above simplifies both of these cases as well as the reader/decoder, without really hurting the GVCF merging use case.

In the case of GVCF merging, the RBS information could be useful for incremental/hierarchical merging, so I'm not saying it should be banished, merely not required in all spec-compliant implementations. As a nitpick, GVCF provides an END position for each reference block, so why switch to block size at this late stage?

yfarjoun · 2019-07-27T02:00:54Z

First, regarding the nit: END is an info field, not format, so this isn't a "change". In particular, RBS will be much smaller numbers and will thus compress much better than END unless we define a special delta encoding for it so instead I thought it would be better (space wise) to use delta directly.

Regarding your proposal, I don't understand how to interpret the '.' in position 600, can you elaborate? how do I know if it's a missing, or a ref block?

kevinjacobs-progenity · 2019-07-27T11:58:40Z

This PR looks like it is adding an interesting feature. However, nowhere is the term "reference block" clearly defined. RBS is defined, but only in terms of a reference block.

I'll speculate and try to put the puzzle this out from the text...

Downstream positions that are covered by the reference block should be missing (‘.’)

This raises a few questions since I'm not grasping the intent here:

Why is it called a reference block if the spanning interval is to be treated as missing (as opposed to reference (with GT being a list of zeros to indicate ploidy)?
So is the intent of FORMAT/RBS to be a mechanism to specify gVCF-like reference assertions in a multi-sample-friendly manner by providing a per-sample mechanism to specify a reference interval? Or am I digging in the wrong spot?

Missing genotypes (‘.’) that are not covered by a reference block are to be interpreted
as truly missing.

Apologies again for being dense, but what is the difference between "missing" and "truly missing"?

mlin · 2019-07-27T14:52:48Z

@yfarjoun position 600 adopts the contents of the first nonempty cell found above it, which is the ./.:0 on position 400. By leaving this marker in position 400 saying "the previous reference block no longer applies", we've obviated the need for readers to look at RBS values and therefore for writers to generate them if inconvenient. They can still be present as extra information from GVCF pipelines of course. Does this make sense?

This is just how spVCF uses the double-quote marker (@kevinjacobs-progenity that link may help provide historical context, albeit slightly different as we're now reconciling parallel efforts). I like . instead of " to the extent it minimizes the format syntax delta we're introducing. (" is cute though as it reflects everyday usage to fill out a form sparsely)

kevinjacobs-progenity · 2019-08-01T11:32:54Z

@mlin: Thanks for the background reading. I see the intent now. I also understand the aesthetic appeal of using '.' as a minimal indicator of sparse content that doesn't require adding any new encoding methods. I still think it is a bad choice because '.' already means nocall and 0 means reference in VCF. Would it be a terrible compromise to require a true reference assertion with proper ploidy? i.e. 0, 0/0, ..., ., ./., ..., etc.? Uncompressed sparse files would be very slightly larger, but the overhead would be negligible when compressed. More so, many more existing tools will sensibly interpret such records, because they are making a coherent assertion according to current VCF semantics. Thus, sparse encoding would be a semantic convention that doesn't require explicit support in the specification.

Here is what your example above would look like based on my suggestion:

POS  FORMAT     Alice
100  GT:DP  0/0:30
200  GT:DP  0/0
300  GT:DP  0/0
400  GT:DP  ./.:0
500  GT:DP  ./.
600  GT:DP  ./.

I don't have a strong opinion on whether the explicit RBS tag is sufficiently valuable to include. It should be easy enough to implement a coherent random-access scheme either way.

mlin · 2019-08-02T17:59:39Z

The compatibility opportunity of @kevinjacobs-progenity idea is difficult to dispute!

Uncompressed sparse files would be very slightly larger, but the overhead would be negligible when compressed.

The compressed overhead would be slightly more than 'negligible' I think, because of relatively increased entropy across the row (flipping between ./. and 0/0 GT). However I do think the low-hanging fruit would remain picked.

yfarjoun · 2019-08-12T13:22:39Z

I'm not convinced.

@mlin: I don't understand. I don't see a huge difference between . and ./. except that ./. seems to be trying to say something about ploidy. but in the presence of RBS, I think it can be overcome. Without RBS, you are lets to the whims of the implementation which may put . or ./. when there's no data. the RBS field forces an ambiguous call to mean something else.
I think that using 0/0 is a really bad idea since that can be interpreted as a confident reference call whereas the confidence of the reference blocks can be low or high. I would be concerned that naive readers of 0/0 would be unaware of the RBS and assume a confident reference call.
A missing genotype field is independent of the actual values in the format field. requiring a 0/0 means that there has to be a GT field, creating even more dependency between the records.

EDIT: Having thought about it more, I think that using 0/0 has enough upsides that it should be used. thanks @kevinjacobs-progenity for the suggestion and patience.

EDIT of the EDIT: the 0/0 suggestion causes problems in the implementation since it requires a GT FORMAT field, but it isn't clear if we will end up having a GT or a LGT (Local GenoType) and thus this will create a mess in terms of dependencies between records.

yfarjoun · 2019-08-12T16:52:42Z

add discussion about end tag for this suggestion

yfarjoun · 2019-08-12T16:53:48Z

add an optional flag for the end of a reference block.
add option to have unknown RBS value

yfarjoun · 2019-08-12T16:53:59Z

Add paragraph describing a reference block.

yfarjoun · 2019-08-12T16:55:41Z

pseudo-code for parsing out the reference-block information

mlin · 2019-08-12T17:13:33Z

@lbergelson pressed me on a good corner case of a haploid, GT-only VCF: then it's unclear how to write the end-of-reference-block marker. Maybe it needs a dedicated FORMAT field?

mlin · 2019-09-08T08:28:51Z

To explicate the last idea that arose from the discussion with @lbergelson last month,

The 'reference block no longer applies here' marker could be a FORMAT flag, perhaps "EOR" for "end of run":

POS  FORMAT      Alice
100  GT:DP:EOR   0/0:30
200  GT:DP:EOR   .
300  GT:DP:EOR   .
400  GT:DP:EOR   .:.:1
500  GT:DP:EOR   .
600  GT:DP:EOR   .

The interpretation of a . entry is to look up the column to the previous non-missing entry, or if we see an EOR flag first, then we really have no data. The EOR markers would appear only rarely, because usually a variant record or downstream reference block picks up just where the last one leaves off.

And let me reiterate I think it's a very good idea for GVCF pipelines to include the RBS or equivalent information; but with the above, I suggest it shouldn't otherwise be required to write or read it. The lively discussion on #437 should highlight a bit of awkwardness in standardizing a form of combined GVCF, when the basic GVCF way of supplying reference identity isn't really standardized itself.

yfarjoun · 2019-09-09T16:27:06Z

note: filtering the EOR will make the reference "block" longer.

d-cameron · 2019-09-09T16:46:31Z

As a general principle, as a specifications we should err on the side of data loss, rather than data corruption. In this particular instance, EOR will result in data corruption when the VCF is filtered thus I strongly prefer the representation that encodes the start and end in the same record.

mlin · 2019-09-10T02:06:56Z

To illustrate how RBS overfits GVCF pipelines, I'd like to focus the use case of making a sparse version of an existing project VCF file, such as this one:

http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20190425_NYGC_GATK/CCDG_13607_B01_GRM_WGS_2019-02-19_chr22.recalibrated_variants.vcf.gz

Given this input, we do not know the GVCF reference blocks and it will be awkward to write RBS without implications that weren't present originally. That's because the project VCF does not, at face value, supply the reference coverage "in between" its sites.

What we do have to work with, are repetitive runs of reference-homozygous calls down the column for each sample. I've approached our current juncture from the perspective of succinctly encoding these runs as they appear in the text of an existing multi-sample VCF file. I believe RBS approaches from a different perspective, merging a multitude of GVCF files into a new kind of "combined" GVCF. My proposed encoding is quite awkward when viewed from the latter perspective, and the reverse is also true.

(Note: the degree of repetition in the public file with "only" N=2.5K samples is much less than fonud in the N>100K datasets many of us work behind the scenes. Nonetheless we can find many examples of entries exactly equal to the one above, including all FORMAT values. Furthermore we can ask what FORMAT value fluctuations are worth keeping at huge scale.)

The distinction is further illustrated by the action item stated above to "Add paragraph describing reference block," needed because that concept does not exist in the VCF spec (though the defined fields accommodate it). In this PR we'd essentially standardize (the main concept of) GVCF and a new kind of combined GVCF — with the curiosity that one de facto uses END and the other uses RBS.

I think my suggested perspective carries less baggage, as it's just based on repetition found plainly in the existing multi-sample VCF format, such as the above-linked file. But I'm prepared to agree to disagree on which perspective the group should move ahead with, so long as whomever calls the play has thoughtfully considered both. And if we're to proceed with standardizing GVCF concepts, then I'd think that would involve a wider interest group.

@d-cameron It's true that filtering the run-encoded form is trickier. It's not surprising that deleting stuff from a run-encoded format, without trying to fix up the encoding, leads to corruption. This would be a fine reason to strongly prefer a length encoding all other things being equal — but I hope the above clarifies that there are other key differences right now, especially the concept of "reference block." In fact, from my perspective we could look at ways to use a length encoding for "next L lines in this file" as opposed to "next D bp of reference genome."

hts-specs-bot · 2019-10-07T15:27:18Z

Changed PDFs as of c1f092c: VCFv4.3 (diff).

nh13 · 2023-07-05T17:17:42Z

VCFv4.3.tex

@@ -410,8 +421,8 @@ \subsubsection{Genotype fields}
      PL		& G		& Integer	& Phred-scaled genotype likelihoods rounded to the closest integer \\
      PQ		& 1		& Integer	& Phasing quality \\
      PS		& 1		& Integer	& Phase set \\
+      RBS		& 1		& Integer	& Reference Block Size\\ 


Suggested change

RBS & 1 & Integer & Reference Block Size\\

RBS & 1 & Integer & Reference block size\\

jmarshall · 2023-07-10T10:39:36Z

VCFv4.3.tex

@@ -253,6 +253,17 @@ \subsubsection{Pedigree field format}
 ##pedigreeDB=URL
 \end{verbatim}

+
+\subsubsection{Reference block checkpoint}


This should be placed after the final sentence (“See [PedigreeInDetail] for details”) of the preceding section.

jmarshall · 2023-07-10T10:43:04Z

VCFv4.3.tex

+Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block. 
+To enable this, one may specify a Reference Block Checkpoint scheme:
+\begin{verbatim}
+##REFERENCE_BLOCK=<CHECKPOINT=1000>


This is invalid as a structured meta-information line, because it does not have an ID=… subfield.

Are there plans to expand this in future with additional subfields within the <…>? If not, I'd suggest reworking this as a plain unstructured meta-information line, e.g.,

##REFERENCE_BLOCK_CHECKPOINT=1000

d-cameron · 2023-07-10T11:33:22Z

Notes for Future of VCF review meeting:

Does this do anything other than being a per-sample <*> encoding?
- What (if anything) is stopping the auto-conversion from gVCF to RBS? Is RBS->gVCF lossless?
In gVCF, END is used for indexing to ensure we get all the relevant records for a given position (i.e. differentiate no-call from REF). We could do the same with RBS and have END implicitly calculated as POS+max(RBS)
- We'd need to require no RBS record be <*> for the implicit END but that's ok if we can auto-convert.

danking · 2024-02-06T20:54:09Z

VCFv4.3.tex

+Given the ability to interpret missing genotypes as either truly missing or as part of a reference block (see RBS in Genotype tags below), it can be useful to limit the genomic distance required to scan in order to find the "top" of the reference block. 
+To enable this, one may specify a Reference Block Checkpoint scheme:
+\begin{verbatim}
+##REFERENCE_BLOCK=<CHECKPOINT=1000>


Suggested change

##REFERENCE_BLOCK=<CHECKPOINT=1000>

##REFERENCE_BLOCK_CHECKPOINT=1000

just reifying this as a one-click GitHub suggestion.

danking · 2024-02-06T20:54:36Z

VCFv4.3.tex

+
+  For example (with CHROM, ID, REF, ALT, QUAL, FILTER, INFO fields/columns removed for brevity \& clarity):
+
+ \#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\


Suggested change

\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\

\#\#REFERENCE\_BLOCK\_CHECKPOINT=1000\\

See above comment by jmarshall

danking · 2024-02-06T20:57:29Z

VCFv4.3.tex

+  To disambiguate a `.'  between being truly missing and part of a reference block, one would therefore need to "look up" and find the previous RBS FORMAT value in that sample.
+  In addition, any non-missing value (including `.:.' or `./.') would effectively break a reference block, and should be treated as a violation of the specification if RBS is specified, or an implicit end of the block if RBS is unknown.
+  When reading the file from top to bottom, an implementation can simply remember what the RBS is for each sample, however when using the index to ``seek" to a particular point of the reference, one may need to seek to an unknown location in the file. 
+  To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists. 


Suggested change

To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.

To assist in seeking, the \verb!##REFERENCE_BLOCK_CHECKPOINT! header line defines a multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.

…ive to #435)

d-cameron · 2024-02-27T17:36:26Z

Proposing #758 instead of this. Please comment as to whether #758 fulfills all the requirements/use cases that this PR was designed to address.

…ive to #435)

added reference blocksize and checkpointing to VCF

bd61dee

responding to review comments

fd669b4

yfarjoun mentioned this pull request Jul 24, 2019

Changes to VCFv4.3 to codify local-allele parlance for sparse VCF #420

Closed

jkbonfield added the vcf label Aug 20, 2019

lbergelson mentioned this pull request Sep 7, 2019

VCF spec allows ALTs that are mixes of * and bases, but doesn't define how to interpret them #437

Open

add example, non-definitive RBS

c1f092c

mlin mentioned this pull request Mar 12, 2020

Outputting homozygous reference dnanexus-rnd/GLnexus#213

Open

yfarjoun added this to the VCF v4.4 milestone May 18, 2020

mlin mentioned this pull request Sep 7, 2020

document convention for "QC squeezing" in population VCF #527

Open

nh13 reviewed Jul 5, 2023

View reviewed changes

jmarshall reviewed Jul 10, 2023

View reviewed changes

jkbonfield mentioned this pull request Nov 28, 2023

Define Local Alleles in VCF to allow for sparser format #434

Closed

danking mentioned this pull request Feb 1, 2024

NegativeArraySizeException when converting plink bed to hailmatrix on large chromosome hail-is/hail#14168

Open

danking reviewed Feb 6, 2024

View reviewed changes

d-cameron pushed a commit that referenced this pull request Feb 27, 2024

Added FORMAT END to support sample-specific $<$*$>$ alleles (alternat…

4ebc5aa

…ive to #435)

d-cameron mentioned this pull request Feb 27, 2024

VCF 4.5 Future of VCF scaling improvements #758

Merged

d-cameron closed this Feb 27, 2024

d-cameron pushed a commit that referenced this pull request Apr 20, 2024

Added FORMAT END to support sample-specific $<$*$>$ alleles (alternat…

e2c3617

…ive to #435)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add reference blocksize and checkpointing to VCF #435

add reference blocksize and checkpointing to VCF #435

yfarjoun commented Jul 24, 2019

hts-specs-bot commented Jul 24, 2019

hts-specs-bot commented Jul 24, 2019

mlin commented Jul 27, 2019

yfarjoun commented Jul 27, 2019

kevinjacobs-progenity commented Jul 27, 2019

mlin commented Jul 27, 2019

kevinjacobs-progenity commented Aug 1, 2019

mlin commented Aug 2, 2019

yfarjoun commented Aug 12, 2019 •

edited

yfarjoun commented Aug 12, 2019 •

edited

yfarjoun commented Aug 12, 2019

yfarjoun commented Aug 12, 2019 •

edited

yfarjoun commented Aug 12, 2019 •

edited

mlin commented Aug 12, 2019

mlin commented Sep 8, 2019

yfarjoun commented Sep 9, 2019

d-cameron commented Sep 9, 2019

mlin commented Sep 10, 2019

hts-specs-bot commented Oct 7, 2019

nh13 Jul 5, 2023

jmarshall Jul 10, 2023

jmarshall Jul 10, 2023

d-cameron commented Jul 10, 2023

danking Feb 6, 2024

danking Feb 6, 2024

danking Feb 6, 2024

d-cameron commented Feb 27, 2024

	RBS & 1 & Integer & Reference Block Size\\
	RBS & 1 & Integer & Reference block size\\

	##REFERENCE_BLOCK=<CHECKPOINT=1000>
	##REFERENCE_BLOCK_CHECKPOINT=1000


		For example (with CHROM, ID, REF, ALT, QUAL, FILTER, INFO fields/columns removed for brevity \& clarity):

		\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\

	\#\#REFERENCE\_BLOCK=\textless CHECKPOINT=1000\textgreater\\
	\#\#REFERENCE\_BLOCK\_CHECKPOINT=1000\\

	To assist in seeking, the \verb!##REFERENCE_BLOCK! header line may define the \verb!CHECKPOINT! multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.
	To assist in seeking, the \verb!##REFERENCE_BLOCK_CHECKPOINT! header line defines a multiple at which a reference block will be included for all samples. In the presence of a checkpoint value, an implementation can read back from the last checkpoint and on and be assured that it will find a reference block that overlaps the current position, if it exists.

add reference blocksize and checkpointing to VCF #435

add reference blocksize and checkpointing to VCF #435

Conversation

yfarjoun commented Jul 24, 2019

hts-specs-bot commented Jul 24, 2019

hts-specs-bot commented Jul 24, 2019

mlin commented Jul 27, 2019

yfarjoun commented Jul 27, 2019

kevinjacobs-progenity commented Jul 27, 2019

mlin commented Jul 27, 2019

kevinjacobs-progenity commented Aug 1, 2019

mlin commented Aug 2, 2019

yfarjoun commented Aug 12, 2019 • edited

yfarjoun commented Aug 12, 2019 • edited

yfarjoun commented Aug 12, 2019

yfarjoun commented Aug 12, 2019 • edited

yfarjoun commented Aug 12, 2019 • edited

mlin commented Aug 12, 2019

mlin commented Sep 8, 2019

yfarjoun commented Sep 9, 2019

d-cameron commented Sep 9, 2019

mlin commented Sep 10, 2019

hts-specs-bot commented Oct 7, 2019

nh13 Jul 5, 2023

Choose a reason for hiding this comment

jmarshall Jul 10, 2023

Choose a reason for hiding this comment

jmarshall Jul 10, 2023

Choose a reason for hiding this comment

d-cameron commented Jul 10, 2023

danking Feb 6, 2024

Choose a reason for hiding this comment

danking Feb 6, 2024

Choose a reason for hiding this comment

danking Feb 6, 2024

Choose a reason for hiding this comment

d-cameron commented Feb 27, 2024

yfarjoun commented Aug 12, 2019 •

edited

yfarjoun commented Aug 12, 2019 •

edited

yfarjoun commented Aug 12, 2019 •

edited

yfarjoun commented Aug 12, 2019 •

edited