Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VCF 4.5 Future of VCF scaling improvements #758

Merged
merged 14 commits into from Apr 20, 2024
Merged

VCF 4.5 Future of VCF scaling improvements #758

merged 14 commits into from Apr 20, 2024

Conversation

d-cameron
Copy link
Contributor

Please review and provide feedback.

Logic behind going with FORMAT END:

  • Significant overlap between add reference blocksize and checkpointing to VCF #435 RBS PR and existing <*> functionality
  • Allowing per-sample <*> via FORMAT END is a minimal change to the specs.
  • END used instead of RBS for naming convention consistency.
    • Happy to change to LEN if someone produces data showing a non-trival difference in file size
  • ##REFERENCE_BLOCK_CHECKPOINT adds a new form of header parsing. We want to keep it simple
  • INFO END is already used by VCF indexing tools to find overlapping SVs. The checkpoint use case of determining whether a position is missing or REF is handled by an indexed lookup on that position. No additional new complexity required - the index handles that complexity!
    • Implementations are free to use checkpoint records (i.e. split up large ref blocks) to ensure no reference block is greater than a given size or even ##REFERENCE_BLOCK_CHECKPOINT style spliting every POS % N bases. The spec gets to stay agnostic on these details.

Copy link

Changed PDFs as of 4ebc5aa: VCFv4.5.draft (diff).

@d-cameron
Copy link
Contributor Author

d-cameron commented Feb 27, 2024

I'd prefer three new numbers (all of which are distinct from the ploidy):

LOCAL-A. One element for each allele present in LA minus one (to exclude the reference). For example: LEC.
LOCAL-R. One element for each allele present in LA (including the reference). For example: LAD.
LOCAL-G. One element for each possible genotype.

Options:

  1. Leave as .
  2. Have R A G take mean the local allele count if LAA is not missing.
  3. Use lowercase r, a, g for local allele Number=a vs Number=A

@danking thoughts?

(1) is the most backwards compatible, (2) has an elegant simplicity to it, (3) makes local alleles explicit and allows custom fields to be defined.

I personally favour (2) as non-local to local is a lossless transition but the consensus for LAD et al seem to be that having explicitly different fields was better because causeing errors/crashes in non-local-aware VCF consumers was preferable to the possibility of (probably silent) data misinterpretation.

(3) has the advantage of having a generalising principle for all R A G fields that we could encode in the specs: if you have a R/A/G field, the local allele equivalent has an L prefix and a lowercase Number= field. We could then change this PR a standardisation of this convention for all spec-defined fields and a recommendation for non-spec fields to also follow this convention. It would enable tooling to translate arbitrary VCFs into local-allele encoded versions.

VCFv4.5.draft.tex Outdated Show resolved Hide resolved
@jkbonfield jkbonfield added the vcf label Feb 27, 2024
Co-authored-by: John Marshall <jmarshall@hey.com>
Copy link

Changed PDFs as of 099fe19: VCFv4.5.draft (diff).

github-actions bot pushed a commit that referenced this pull request Feb 27, 2024
@jkbonfield
Copy link
Contributor

jkbonfield commented Feb 28, 2024

Following on from discussions on the call yesterday, here is the Hail description of SVCR:

https://hail.is/docs/0.2/vds/index.html#the-scalable-variant-call-representation-svcr

Specifically they have this example

Before:    Ref=G Alt=T GT=0/1 AD=5,6 PL=102,0,150
After:     LA=0,2 LGT=0/1 LAD=5,6 LPL=102,0,150

They're using LA instead of LAA as Locale Allele vs Local Alternate Allele. The key difference being it includes REF in the list, and LGT 0/1 is 0th element of LA and 1st element of LA. I believe this is in keeping with the changes you made here vs #434, which I believe most are in agreement with (given the inherent impossibilities of storing zero-length arrays in BCF).

However, as discussed, we have tools already implementing LAA using #434 semantics, and so we need a clear separation so people that get LAA in a file know what the data means. Adopting Hail's "LA" name seems like the ideal approach.

(I know you're already in agreement Daniel, but this is here for others who weren't on the call and as a formal place to note this proposed edit.)

Pinging @pd3 for comment as you're one of the existing implementers and can far better understand the nuances than I.

@d-cameron
Copy link
Contributor Author

Adopting Hail's "LA" name seems like the ideal approach.

Yes, given that it's now 0-based and we need to explicitly include REF, LA is a more appropriate name than LAA.

VCFv4.5.draft.tex Outdated Show resolved Hide resolved
VCFv4.5.draft.tex Outdated Show resolved Hide resolved
VCFv4.5.draft.tex Outdated Show resolved Hide resolved
\subsubsection{Multi-sample REF-only blocks}
When handling VCFs with multiple samples, the length of the $<$*$>$ reference blocks can differ.
To account for this, a sample-specific END can be specified via the FORMAT END field.
If any FORMAT END value exists, the INFO END must be present and equal the largest FORMAT END value.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this ensures that a single sample VCF using ref blocks is a valid GVCF?

I'm having a bit of trouble imagining any other reason to want to know the max of the ENDs (though the min does seem perhaps useful: that's the next place some ref block will start).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After reviewing everything a second time, may I propose this instead?

If the VCF contains exactly one sample column, the INFO END field must be present and equal to the FORMAT END value at every locus. Otherwise (zero or two or more samples), the INFO END must be missing.

The INFO END, as a single value, will always be misinterpreted by tools written for GVCF. They will either treat some reference genotypes as missing (if INFO END is the min over all samples) or treat some missing or variant genotypes as reference genotypes (if INFO END is the max over all samples). Given this, I prefer that INFO END is required to be missing unless there is exactly one sample.

Comment on lines 1753 to 1760
\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4388 & 0/0:0,1:4416 \\
4389 & T & TC & . & LGT:LAA:END & 0/1:0,1:. & . \\
4390 & C & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4416 & . \\
\end{tabular}
\end{flushleft}
Copy link
Contributor

@danking danking Mar 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should include all the rows from above to make absolutely clear that this just adds
another sample.

\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4383 & 0/0:0,1:4416 \\
4384 & C & $<$*$>$ & END=4388 & LGT:LAA:END & 0/0:0,1:4388 & . \\
4389 & T & TC & . & LGT:LAA:END & 0/1:0,1:. & . \\
4390 & C & $<$*$>$ & END=4390 & LGT:LAA:END & 0/0:0,1:4390 & . \\
4391 & C & $<$*$>$ & END=4395 & LGT:LAA:END & 0/0:0,1:4395 & . \\
4396 & C & $<$*$>$ & . & LGT:LAA:END & 0/0:0,1:. & . \\
4397 & C & $<$*$>$ & END=4416 & LGT:LAA:END & 0/0:0,1:4416 & . \\
\end{tabular}
\end{flushleft}

Three more potentially more controversial changes:

  1. Use a different END for SampleB to make completely clear that SampleB's END has nothing to do
    SampleA's END.

  2. Do not use LGT or LAA. Two reasons: (a) make the multi-sample ref-block changes most prominent and (b) clarify that ref-blocking and local alleles are orthogonal compression techniques [1].

  3. Preserve the star allele on line 4389 to make clear that removing it is not necessary.

\begin{flushleft}
\begin{tabular}{ l l l l l l l l }
POS & REF & ALT & INFO & FORMAT & SampleA & SampleB \\
4370 & G & $<$*$>$ & END=4500 & GT:END & 0/0:4383 & 0/0:4500 \\
4384 & C & $<$*$>$ & END=4388 & GT:END & 0/0:4388 & . \\
4389 & T & TC,$<$*$>$ & . & GT:END & 0/1:. & . \\
4390 & C & $<$*$>$ & END=4390 & GT:END & 0/0:4390 & . \\
4391 & C & $<$*$>$ & END=4395 & GT:END & 0/0:4395 & . \\
4396 & C & $<$*$>$ & . & GT:END & 0/0:. & . \\
4397 & C & $<$*$>$ & END=4416 & GT:END & 0/0:4416 & . \\
\end{tabular}
\end{flushleft}
Screenshot 2024-03-04 at 3 45 21 PM

[1] It's true that O(samples) scaling is only achieved with both local alleles and ref blocking;
however, they're independent ideas and, I think, easier to first understand in isolation.

@danking
Copy link
Contributor

danking commented Mar 4, 2024

Numbers

I'd prefer three new numbers (all of which are distinct from the ploidy):
LOCAL-A. One element for each allele present in LA minus one (to exclude the reference). For example: LEC.
LOCAL-R. One element for each allele present in LA (including the reference). For example: LAD.
LOCAL-G. One element for each possible genotype.

Options:

  1. Leave as .
  2. Have R A G take mean the local allele count if LAA is not missing.
  3. Use lowercase r, a, g for local allele Number=a vs Number=A

@danking thoughts?

(1) is the most backwards compatible, (2) has an elegant simplicity to it, (3) makes local alleles explicit and allows custom fields to be defined.

I personally favour (2) as non-local to local is a lossless transition but the consensus for LAD et al seem to be that having explicitly different fields was better because causeing errors/crashes in non-local-aware VCF consumers was preferable to the possibility of (probably silent) data misinterpretation.

(3) has the advantage of having a generalising principle for all R A G fields that we could encode in the specs: if you have a R/A/G field, the local allele equivalent has an L prefix and a lowercase Number= field. We could then change this PR a standardisation of this convention for all spec-defined fields and a recommendation for non-spec fields to also follow this convention. It would enable tooling to translate arbitrary VCFs into local-allele encoded versions.

(1) is fine but feels like a cop out for a standard. I oppose (2) for the reason you mentioned (I prefer errors/crashes to bad data). (3) feels somehow too clever, but I prefer it to (1) and (2). I prefer the verbose "LOCAL-A", etc. to (3), but I would grumpily implement (3) and then never think about it again.

LAA vs LA

Yes, given that it's now 0-based and we need to explicitly include REF, LA is a more appropriate name than LAA.

Agreed.

LEN vs END

I only have anecdotes from Tim Poterba, not hard data, but: we saw ~10% reduction in size of the Hail VDS when we switched from END to LEN. I assume all the benefit comes from feeding Zstd a lower entropy bytestream.

I'd leave LEN vs END as a possible future enhancement. A VCF with local alleles and ref blocks scales linearly in samples with or without LEN.

Etc.

No objections to your other choices. Checkpointing or an END-aware index is key for analysis but not for interchange. I view VCF as primarily an interchange format.


cc'ing @chrisvittal and @patrick-schultz as I'm no longer Broad/Hail affiliated.

FWIW, I think the SVCR paper is a little more detailed than the Hail docs currently. https://www.biorxiv.org/content/10.1101/2024.01.09.574205v1.full.pdf

VCFv4.5.draft.tex Outdated Show resolved Hide resolved
@d-cameron
Copy link
Contributor Author

Checkpointing or an END-aware index is key for analysis but not for interchange.

The reason for defining INFO END as I did was to avoid breaking VCF indexing. VCF indexing already uses INFO END to define the end of SV and gVCF records so reusing this means that checkpointing is unnecessary as the information needed for determining maximal record overlap is already contained within the index.

The INFO END, as a single value, will always be misinterpreted by tools written for GVCF.

If we're erring on the side of crashing instead of data corruption then deprecating INFO END for gVCF would be the way to go. Unfortunately, taking that approach will silently break VCF indexing (since it relies on INFO END). There's no good option here as keeping INFO END results in silent gVCF misinterpretation and dropping it results in silent record loss on indexed lookups.

I would grumpily implement (3) and then never think about it again.

How much value is there in being able to define caller-specific local allele fields? (3) generalises local alleles not only for scaling, but it allows lossless merging. What do implementation do with the GL when merging two VCF with a different set of ALT alleles? From what I can tell, the only viable current option is to drop it since the merging tool doesn't have the information/model that the caller has.

we saw ~10% reduction in size of the Hail VDS when we switched from END to LEN

That seems significant enough to prefer LEN. We've already made SVLEN preferable to END for v4.4 SVs so doing it for symbolic star alleles for 4.5 is consistent with that.

On a related note, what is everyone's thoughts about recommending tools writing VCFv4.5 use local allele notation for all VCFs? Useful? Unnecessary?

@danking
Copy link
Contributor

danking commented Mar 5, 2024

The reason for defining INFO END as I did was to avoid breaking VCF indexing. [...]

If we're erring on the side of crashing instead of data corruption then deprecating INFO END for gVCF would be the way to go. [..]

Mmm. Good points. Defining INFO END as the max seems fine then.

How much value is there in being able to define caller-specific local allele fields? (3) generalises local alleles not only for scaling, but it allows lossless merging. What do implementation do with the GL when merging two VCF with a different set of ALT alleles? From what I can tell, the only viable current option is to drop it since the merging tool doesn't have the information/model that the caller has.

I don't have experience with variant callers so I can't comment on the first question.

Maybe I'm misreading but I think any of the three options for encoding local allele indexed fields facilitates lossless merging.

I agree that (global allele indexed) GL, PL, and AD are hard to sensibly define when merging two VCFs. In particular, I'm unclear on when DP does not equal sum(AD) and the implications thereof when merging two VCF rows with global allele indexed fields.

we saw ~10% reduction in size of the Hail VDS when we switched from END to LEN

That seems significant enough to prefer LEN. We've already made SVLEN preferable to END for v4.4 SVs so doing it for symbolic star alleles for 4.5 is consistent with that.

Would we change the GVCF section as well? So we'd have INFO LEN, FORMAT LEN, and (no longer preferred? deprecated?) INFO END?

On a related note, what is everyone's thoughts about recommending tools writing VCFv4.5 use local allele notation for all VCFs? Useful? Unnecessary?

Soft against. It's more complex and not necessary for plenty of useful cases. For example, seems silly to local allele index fields in a VCF that has been fully split into biallelic variants.

@jkbonfield
Copy link
Contributor

jkbonfield commented Mar 5, 2024

I agree that (global allele indexed) GL, PL, and AD are hard to sensibly define when merging two VCFs. In particular, I'm unclear on when DP does not equal sum(AD) and the implications thereof when merging two VCF rows with global allele indexed fields.

I don't have enough experience either, particularly with multi-sample calling and gVCF, but just as some background I do understand why AD and DP can be very different as I recently fixed over counting of AD in bcftools. Imagine this scenario:

REF   GATCGATATATATGTCGTC                                                       
                                                                                
SEQs  GATCGATA                                                                  
      GATCGATATATA                                                              
      GATCGATATATATG                                                            
      GATCGATATATATGTC                                                          
      GATCGATATATATGTCGTC                                                       
       ATCGA--TATATGTCGTC                                                       
        TCGA--TATATGTCGTC                                                       
          GATATATATGTCGTC                                                       
              TATATGTCGTC                                                       
            ^                                                                   
            2bp deletion                                                        

The depth at that point is 8 sequences. It's an AT STR, with 6 of them spanning it so the copy number has been confirmed. Of those we have 4 with 4 copies and 2 with 3 copies, so AD 4,2 and DP 8. The missing two have been aligned against the reference and exhibit "reference bias" as they naively confirm REF but there is no evidence of that confirmation being true and they shouldn't be added to the REF AD score. (There is an option however to distribute the uncontributed alignments to the observered REF/ALT in proportion to their observed frequencies, in order to boost AD, but it's not enabled by default as it can bring other issues and potentially over confidence in results.)

How you marry this together when merging I've no idea. Sorry! :)

Edit: you could make an argument for DP=6, but that's misleading as we do have 8 alignments spanning this variant and if we were attempting to do depth analysis it's wrong to down-play DP this way as that may lead to other incorrect copy-number assessments.

@d-cameron
Copy link
Contributor Author

Maybe I'm misreading but I think any of the three options for encoding local allele indexed fields facilitates lossless merging.

A VCF with custom R/A/G FORMAT fields cannot be losslessly converted to local allele notation unless there is the ability to translate arbitrary R/A/G fields to an equivalent r/a/g version. (3) achieves this by generalisating the local allele convention to all fields - even those not reserved by the specs.

@d-cameron
Copy link
Contributor Author

d-cameron commented Mar 12, 2024

Would we change the GVCF section as well? So we'd have INFO LEN, FORMAT LEN, and (no longer preferred? deprecated?) INFO END?

FORMAT LEN and retain INFO END as a purely computed field to prevent breaking VCF indexing. We already have INFO SVLEN causing recalculation of INFO END so this simplifies the spec as INFO END becomes an entirely computed field that a validator can verify than INFO END=POS + max(FORMAT LEN & INFO SVLEN).

Similar to the SV changes in v4.4, just having INFO END would be deprecated but a 4.5 compliant implementation should infer the value of a single missing FORMAT LEN from INFO END.

@pd3
Copy link
Member

pd3 commented Apr 4, 2024

Hi, sorry I am a bit late to this thread, or maybe I was on it too early, depending on how you look at it

The ideas proposed in the original pull request (#434) in July 2019 were implemented in bcftools in May 2020 (samtools/bcftools@e645749). Currently there are big files out there that follow the original proposal, see for example UK Biobank which generated VCFs with 490,541 samples using DRAGEN. (Note that DRAGEN implementation has some bugs, please contact me directly if you are interested to learn more.)

I have several comments related to the current proposal:

  1. The addition of LGT argued in this comment Define Local Alleles in VCF to allow for sparser format #434 (comment) is not a good idea. As pointed out by Define Local Alleles in VCF to allow for sparser format #434 (comment), it doesn't save any space and just adds the indirection of parsing through LAA.

    The motivation given in that thread cites the need to subset samples and remove unused ALT alleles. This would lead to recoding of local alleles for all samples. However, that argument is not quite right:

    First, localized alleles were introduced to reduce space taken by the quadratic Number=G format fields, to a lesser degree by the Number=A,R tags. Removing a few extra unused alleles from the ALT column will never be an issue - why not just leave them, or recode only rarely after many alleles were removed??

    Second, and more importantly, the LAA indices refer to alleles in the ALT column. Therefore if any change is made to ALT, the LAA tag must be changed as well. Unless I grossly misunderstood the example, using LGT only helps with not having to recode LGT, but LAA itself needs to be recoded anyway. Timewise there will be no measurable benefit, the complexity of the operation will remain at O(N_SAMPLES_KEPT), and LGT adds no benefit.

    Third problem, and perhaps yet even more important, all existing programs relying on GT will stop functioning on such VCFs.

    I don't find the addition of LGT into the specification sufficiently justified.

  2. In the original draft, and in several existing implementations of the draft (bcftools, DRAGEN, maybe others), the LAA tag includes only alternate alleles and does not include the REF allele. This pull request proposes a new tag LA which includes the REF allele. This has already resulted in two competing implementations and their use in large datasets.

    I don't understand the motivation for introducing this split. I was told that it was argued offline that it had something to do with the missing allele, but I was not able to understand why. With LAA, a REF-only site can be encoded as e.g.

GT=0/0      LAA=.   LAD=20      LPL=0

Missing genotype can be encoded simply as

GT=./.      LAA=.   LAD=.      LPL=.

What is the benefit of introducing LA other than using more space and adding uninformative "0"?? The localized alleles are meant only to narrow the context of the alleles so that Number=G,R,A tags can be expressed in a more parsimonious way. They were not intended to make statements about sequencing or biological observations made for a sample.

@danking
Copy link
Contributor

danking commented Apr 4, 2024

Hi @pd3 ! I try to explain the decisions below but see also my final paragraph.

I agree with all your comments on 1 and, in fact, this was our perspective when we first built tools using local alleles and ref blocks. The issue is not of complexity or speed; it’s of practical user experience. IIRC (we should just ask them) gnomAD team had a mistake arise from neglecting to update a GT field when an allele was filtered. They reasonably asked: why is GT special? There’s less cognitive overhead for them as method developers if all the fields use local indices.

AFAIU, this change just allows users who prefer to work with LGT to do so. I suppose the practical concern is that people will start generating LGT containing VCFs so tools will need to support that?

IIRC, on your second issue, folks wanted to distinguish “only the reference allele is local” from “LAA is missing”. I suggested LA as a way to appease this concern.

All that said, the only thing I seek (and I think Hail team, with which I’m no longer affiliated, also seeks) is an interchange format that scales linearly in samples. Sparse columns (ref blocks) and sparse allele-indexed arrays (local alleles) achieve this. Whatever details make it into the standard we’ll just deal with on import/export.

@danking
Copy link
Contributor

danking commented Apr 4, 2024

Minor nit:

This pull request proposes a new tag LA which includes the REF allele. This has already resulted in two competing implementations and their use in large datasets.

Hail always used LA, so it’s not really this PR that created the problem.

@jkbonfield
Copy link
Contributor

Minor nit:

This pull request proposes a new tag LA which includes the REF allele. This has already resulted in two competing implementations and their use in large datasets.

Hail always used LA, so it’s not really this PR that created the problem.

Agreed. We were chatting about this a few days ago.

Basically there are already two competing implementations: LA and LAA. Both have existed for a considerable length of time. If I had to weigh up the usage I'd say it's Hail and it's users (Broad and AllOfUs being the biggest maybe) vs Illumina DRAGEN and Bcftools with BioBank being the biggest user there. There are many more users of each ecosystem too, including probably very large ones I'm not aware of. So making a decision purely on usage doesn't really help, and it needs to be evaluated on complexity and fitting to the data modelling.

My initial thought was that we needed LA because we can't have an empty (non .) LAA in BCF (as was used in an example in the original PR). We concluded that just broke things and we didn't like the disparity between unknown (.) vs absent. However Petr does have a valid point that dot is already used in other scenarios to mean not-called rather than simply unknown.

I do have confusion over LGT vs GT though. If we're using LAA instead of LA, then LGT doesn't make sense because a GT 0/1 would refer to what in LGT? ./0 (0 being 1st alt in LAA)? Would tools even accept that? It's not clear to me how it's meant to work with REF/ALT heterozygous calls if REF is absent. GT I assume is still going on the main REF and ALT columns (ie totally unchanged) so it's simply some slightly larger numbers. That's still a loss in compressability (more so than LAA to LA overhead with a good backend compression codec), however in the overall scheme of themes both are probably an irrelevance for sizing as it's the LPL which is making the huge difference. I also note that the 434 example had LAA:LGT:LAD:LPL& :0/0:30:0 which means no sense at all. An empty LAA means only REF is permitted, but then we have LGT referring to the first (unspecified) ALT allele. I assume the intention was LGT had an explicit 0 for REF, but then it's offset from LAA. Muddled thinking (theirs or mine?).

Comment on lines 487 to 490
LAA & . & Integer & Strictly increasing indices into REF and ALT, indicating which alternate alleles are relevant (local) for the current sample \\
LAD & . & Integer & Read depth for each of the local alternate alleles listed in LAA \\
LGT & . & String & Genotype against the local alleles \\
LPL & . & Integer & Phred-scaled genotype likelihoods rounded to the closest integer for genotypes that involve the local alternative alleles listed in LAA \\
Copy link
Contributor

@jkbonfield jkbonfield Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are these all number .?

We had explicit dimensions before: A, R, G, 1 (GT), etc. We now have local equivalents for these, but it feels like there should be some better language to explain that rather than just doing a shrug.

Particularly LGT which surely is the same dimension as GT: Type=String, Number=1. (Although personally I think it was a mistake to shoehorn in multiple data types to the one GT field with an entirely new set of language syntax for one field only)

If we don't want to define them as terms in the grammar, then at least explicitly specify the dimensions within the text. LPL gives no clue at all as to how many values there are. I understand what's going on, but only because I already understand what's happening...

LAA is the strictly increasing index into REF and ALT, pointing out the alleles that are actually in-play for that sample.
0 indicates the REF allele and should always be included with the subsequent values being 1-based indexes into ALT.
LAD is the depth of the local alleles,
LPL is subset of the PL array that pertains to the alleles that are referred to by LAA,
Copy link
Contributor

@jkbonfield jkbonfield Apr 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly here. Eg in a classic world we may have:

GT:PL:AD        0/1:187,0,28:2,7

In local world, I'm struggling and the above explanations seem to be stating something wrong. With LAA 0 (one ALT), LPL would refer to both REF and ALT#0, not ALT#0 only. Hence "pertains to the alleles that are referred to by LAA" should be "pertains to the alleles that are referred to by LAA and REF".

I assume similarly true for LAD which claims to be "depth of the local alleles". Isn't it "depth of REF and the local alleles"?

Basically, the only way I can see this working is if all the local versions are REF + whatever is listed in LA and we drop LAA. So I would argue that perhaps it is the right thing to switch to LA instead as then all of the above statements start becoming true once more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The text of this proposal is incorrect. It says LAA but describes LA instead.

LAA is a list of increasing, 1-based indices into ALT alleles. The original pull request states this correctly #434.

Copy link
Contributor

@jkbonfield jkbonfield Apr 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see. Thanks. That explains my confusion over numbering then. I hadn't realised it had changed LAA this way. It's a bad idea. EIther we should keep LAA as it was or switch to LA, but using LA semantics in LAA will just break a whole bunch of existing software and make it extremely tricky to parse existing files.

@pd3
Copy link
Member

pd3 commented Apr 8, 2024

I do have confusion over LGT vs GT though. If we're using LAA instead of LA, then LGT doesn't make sense because a GT 0/1 would refer to what in LGT? ./0 (0 being 1st alt in LAA)?

The definition of LGT in the context of LAA is consistent. LAA lists locally relevant alternate alleles, their indexing is 1-based, the same as in GT. Heterozygous genotypes would be expressed as LAA=1 GT=0/1 LGT=0/1.

Note that despite defending its logic here, I am strongly against codifying LGT for the reasons I explained above (#758 (comment)).

@jkbonfield
Copy link
Contributor

I do have confusion over LGT vs GT though. If we're using LAA instead of LA, then LGT doesn't make sense because a GT 0/1 would refer to what in LGT? ./0 (0 being 1st alt in LAA)?

The definition of LGT in the context of LAA is consistent. LAA lists locally relevant alternate alleles, their indexing is 1-based, the same as in GT. Heterozygous genotypes would be expressed as LAA=1 GT=0/1 LGT=0/1.

Agreed on there being little benefit to LGT over GT. It's smaller, but the big win is LPL and everything else is icing on the cake.

Interesting different interpretation to existing GT being 1 based. I'd sort of mentally just pictured a zero based list of REF & ALTs all squished together. It amounts to the same thing, but breaks my mental model when we start mixing LA vs LAA together, and the "local" here means "ref+local" which isn't an obvious first thought. Obvious in hindsight clearly!

@a9ill
Copy link

a9ill commented Apr 11, 2024

Apologies for being late to join this discussion. I am developer at Illumina working on DRAGEN gVCF Genotyper. Incorporating a description of a local representation into the standard is important and the discussion here is worthwhile.

As has been pointed out (#758 (comment)), the proposed text seems to confuse LA (local alleles - including the reference) and LAA (local alternative alleles - excluding the reference) in both the definitions and examples.

My preference is for LAA as it is the most parsimonious, since the “0” in LA does not convey information (as pointed out #758 (comment)). LAA should only be empty for reference homozygous genotypes without other alleles present and should only be missing for missing genotypes. As these can be distinguished through examination of GT, I don't understand the utility of the ability to distinguish between them based on LA alone. Please correct my understanding if I have missed something.

It might be of interest that the recently released UK Biobank WGS data (https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/news/world-s-largest-genetic-project-opens-the-door-to-new-era-for-treatments-and-cures-uk-biobank-s-major-milestone) was joint called using DRAGEN and LAA was used to specify the local alleles.

For merging and splitting local representations it will be necessary to specify the expected number of elements in local array shaped fields. Using “.” will not achieve this. Using “R”, “A” and “G” and taking them as local if LAA/LA is present would preclude the mixing of local and non-local fields. I agree with (#758 (comment)) that using lower case seems too clever. My preference would be to use “LOCAL-A”, “LOCAL-R” and “LOCAL-G” or more succinctly “LA”, “LR” and “LG”.

@danking
Copy link
Contributor

danking commented Apr 11, 2024

Regarding the desire to distinguish “missing LAA” from “only the reference is present”, I did not personally raise this concern so I’m trying to interpret others’ concerns. That said, I think the use of two distinct sets of numbers (LA/LOCAL-A/a versus A, etc.) mitigates their concerns.

Consider a VCF which has some rows with local allele indexed items and some without. If there were no special numbers and no special field names, then it’s unclear when PL, for example, is local without deducing that from the alleles and the PL length. However, if PL is always G and LPL is always LG/LOCAL-G/g, then no deduction is needed and there’s no need to signal “this row uses local alleles” via a missing LAA.

@danking
Copy link
Contributor

danking commented Apr 11, 2024

@pd3, I’d like to make a final nudge on LGT, but after this I’ll be quiet about it.

I agree with your points one and two: LGT is not relevant to VCF size or analysis speed. I also agree with point three: a VCF with only LGT is meaningless to tools only aware of GTs.

My counter-point is about users—not engineering constraints. Mixing GT with LPL and LAD introduces cognitive overhead and, in our practical experience, leads to mistakes.

I argue in favor of allowing LGT. If two scientific groups want to exchange data in this format we should let them and let them deal with the ensuing incompatibilities with legacy tools.

Any group publishing ”final” data for an extremely wide audience (e.g. AoU, UKB) will likely choose to maximize compatibility with existing tools. Groups with other constraints can choose what’s best for them.

@pd3
Copy link
Member

pd3 commented Apr 12, 2024

@pd3, I’d like to make a final nudge on LGT, but after this I’ll be quiet about it.

I agree with your points one and two: LGT is not relevant to VCF size or analysis speed. I also agree with point three: a VCF with only LGT is meaningless to tools only aware of GTs.
My counter-point is about users—not engineering constraints. Mixing GT with LPL and LAD introduces cognitive overhead and, in our practical experience, leads to mistakes.

I argue in favor of allowing LGT. If two scientific groups want to exchange data in this format we should let them and let them deal with the ensuing incompatibilities with legacy tools.

Any group publishing ”final” data for an extremely wide audience (e.g. AoU, UKB) will likely choose to maximize compatibility with existing tools. Groups with other constraints can choose what’s best for them.

I don't find cognitive overhead a good argument. Few users look manually inside VCFs with hundreds of thousands of samples. Backward compatibility is a real concern, what matters is that programs continue to work with such files. With PLs there is no other way, sites with many indel alleles have no choice but to use LPL, and for most analyses it does not matter if programs skip such sites when they don't find PL, they are relatively rare. But there is no reason to confuse programs with GT/LGT.

Having said that, VCF format is open and nothing stops you from using arbitrary tags. But the use of LGT and LA should not be encouraged by including them in the specification, for reasons given above.

@pd3
Copy link
Member

pd3 commented Apr 12, 2024

For merging and splitting local representations it will be necessary to specify the expected number of elements in local array shaped fields. Using “.” will not achieve this. Using “R”, “A” and “G” and taking them as local if LAA/LA is present would preclude the mixing of local and non-local fields. I agree with (#758 (comment)) that using lower case seems too clever. My preference would be to use “LOCAL-A”, “LOCAL-R” and “LOCAL-G” or more succinctly “LA”, “LR” and “LG”.

I like and support this proposal.

@a9ill Unrelated to this, but for the lack of a better channel, there are two bugs in DRAGEN's implementation of local alleles (found in UKB VCFs)

  • LAA must be Type=Integer, not String
  • the proposal requires LAA to be ordered in strictly ascending order. There are cases where LAA=5,2; now it is not clear what order are the corresponding LAD and LPL values

@danking
Copy link
Contributor

danking commented Apr 12, 2024

I don't find cognitive overhead a good argument. Few users look manually inside VCFs with hundreds of thousands of samples. Backward compatibility is a real concern, what matters is that programs continue to work with such files.

I’m glad we’ve arrived at what seems a key difference in perspective. The gnomAD team & AoU team don’t look at textual VCFs but they indeed directly interact with the data model. They aren’t the majority.

Co-authored-by: Dan King <daniel.zidan.king@gmail.com>
Copy link

Changed PDFs as of 9b472b4: VCFv4.5.draft (diff).

github-actions bot pushed a commit that referenced this pull request Apr 20, 2024
Copy link

Changed PDFs as of d769847: VCFv4.5.draft (diff).

github-actions bot pushed a commit that referenced this pull request Apr 20, 2024
Copy link

Changed PDFs as of abf2531: VCFv4.5.draft (diff).

github-actions bot pushed a commit that referenced this pull request Apr 20, 2024
Copy link

Changed PDFs as of 26a9326: VCFv4.5.draft (diff).

github-actions bot pushed a commit that referenced this pull request Apr 20, 2024
@d-cameron d-cameron merged commit 69ed372 into master Apr 20, 2024
1 check passed
Copy link

Changed PDFs as of 3e73e10: VCFv4.5.draft (diff).

github-actions bot pushed a commit that referenced this pull request Apr 20, 2024
@d-cameron d-cameron deleted the vcf45_scaling branch April 20, 2024 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

None yet

6 participants