Fix variant region for ins and dup on intron-exon boundary #709

b0d0nne11 · 2023-10-26T14:33:39Z

Fixes #655

Certain ins or dup variants that occur at the intron/exon boundaries have non-zero offsets but are still coding. This PR ensures that projections return accurate var_p representations for these variants.

Fixing ins variants are pretty straightforward because we know what to insert. Fixing dup variants is more involved because we need to calculate the reference sequence. This reference sequence must be derived from the var_g representation so that we get the correct bases from the intron rather than the previous or following exon. As a result there are some mappings that require AssemblyMapper. In these cases trying to perform the mapping with VariantMapper will throw an HGVSError. In addition, when computing the reference sequence we need to handle strand orientation and avoid normalization.

src/hgvs/utils/altseqbuilder.py

gostachowiak · 2023-11-15T19:04:56Z

@reece I am wondering who can help review this pull request? I believe the core pdot calculation logic is currently giving incorrect results, and we are under some pressure to fix it.

I suppose the two tasks are:

is there agreement that there currently is an issue that needs fixing?
if so, is this approach acceptable?

The basic problem is-- when material is inserted exactly at the intron/exon boundary, should the inserted material be treated as part of the coding region, or as part of the intron?

We believe it is more correct to treat it as part of the coding sequence, because the entire intron and splicing region is intact. So in principle, splicing can still occur using the original splice site.

The current behavior is inconsistent. Insertions at the exon-intron boundary are treated as coding (because the cdot nomenclature doesn't include any intronic positions). Insertions at the intron-exon boundary are treated as non-coding (because the cdot nomenclature does include intronic positions). But I can't think of any reason these situations would be different biologically.

This PR both (1) makes the behavior consistent (always treats the inserted material as coding), and (2) makes the behavior correct (if you're convinced by the argument above).

This PR also makes hgvs consistent with Mutalyzer. Examples:

NM_004380.3:c.3251-1_3251insA

Mutalyzer: NP_004371.2:p.(Ile1084Asnfs*3)
hgvs, current: null
hgvs, after this PR: NP_004371.2:p.(Ile1084AsnfsTer3)

NM_024529.4:c.132-2_132-1dup

Mutalyzer: NP_078805.3:p.(Thr45Glyfs*65)
hgvs, current: null
hgvs, after this PR: NP_078805.3:p.(Thr45GlyfsTer65)

reece · 2023-11-15T19:38:01Z

Hi @gostachowiak: thanks for pinging me. I'll take a look this week and comment here.

gostachowiak · 2023-11-29T20:49:36Z

@reece reminder about this PR. If it would be easier to have a meeting to review together, we would be happy to do that. We will need to consider forking soon (maintaining a separate private version) if fixing this isn't an option. Our users consider this to be a serious flaw.

reece · 2023-12-18T18:21:36Z

@cassiemk @gostachowiak @b0d0nne11 : I'm very sorry that I let this issue fall off my radar. I appreciate the thoughtfulness of the original issue (#655) and the discussion here, and the completeness of the implementation. I have one question, which I'll pick up in a review in the next few minutes.

And I agree with you that the implementation is inconsistent, or at least not well-defined. Thank you for addressing it.

reece

This PR is very nearly ready. Please see the comments in altseqbuilder.py.

reece · 2023-12-18T18:39:32Z

src/hgvs/utils/altseqbuilder.py

+        elif self._var_c.posedit.edit.type == "ins" and self._var_c.posedit.pos.start.offset == 0 and self._var_c.posedit.pos.end.offset == 1:
+            # ins at exon-intron boundary
+            result = self.EXON
+        elif self._var_c.posedit.edit.type == "dup" and self._var_c.posedit.pos.start.offset <= -1 and self._var_c.posedit.pos.end.offset >= -1:


What cases of boundary-spanning dups are intended in this conditional?

I believe that the intended case is a dup that spans an intron-exon boundary (i.e., on the 5' end of an exon). However, I believe that misses a set of these cases and may match unintended cases. I've summarized these here:

Please comment on the following:

self._var_c.posedit.pos.end.offset >= -1: Why not == 0, which is more specific? I believe that end is intended to be in the CDS.

Which of the cases do you intend to catch in the figure?

I'm happy to talk more about this interactively if useful.

@reece Thanks for the comment. The intention in this PR is to handle insertions that are directly at the boundary (not spanning the boundary). So self._var_c.posedit.pos.end.offset >= -1 should be updated to self._var_c.posedit.pos.end.offset = -1 (because a duplication in the left intron will insert material directly at the boundary if the end position is -1).

So in your image:

Not intended to be handled (fixed with the change above)

Not intended to be handled. And, no special logic is needed for the right end of the exon, because a duplication that inserts material at the boundary has no offsets, so is already considered exonic.

2, 3, 4. Good questions, we are thinking about these.

reece · 2023-12-18T18:41:07Z

src/hgvs/utils/altseqbuilder.py

@@ -326,7 +340,7 @@ def _setup_incorporate(self):
                    result = cds_start - 1
                else:  # cds/intron
                    if pos.offset <= 0:
-                        result = (cds_start - 1) + pos.base - 1
+                        result = (cds_start - 1) + pos.base + pos.offset - 1


Do you know if this line is covered by tests prior to this PR? It's suspicious to me that a bug fix of this magnitude doesn't have broader impacts.

I was a little surprised too but if I remove it the following tests fail:

def test_map_of_dup_intron_exon_boundary(self): hgvs_c = "NM_024529.4:c.132-2_132-1dup" hgvs_p = "NP_078805.3:p.(Thr45GlyfsTer65)" var_c = self.hp.parse_hgvs_variant(hgvs_c) var_p = self.am.c_to_p(var_c) > self.assertEqual(str(var_p), hgvs_p) E AssertionError: 'NP_078805.3:p.(Thr45ArgfsTer65)' != 'NP_078805.3:p.(Thr45GlyfsTer65)' E - NP_078805.3:p.(Thr45ArgfsTer65) E ? ^^^ E + NP_078805.3:p.(Thr45GlyfsTer65) E ? ^^^ tests/test_hgvs_assemblymapper.py:210: AssertionError def test_map_of_dup_intron_exon_boundary_rc(self): hgvs_c = "NM_004380.2:c.3251-1dup" hgvs_p = "NP_004371.2:p.(Ile1084SerfsTer3)" var_c = self.hp.parse_hgvs_variant(hgvs_c) var_p = self.am.c_to_p(var_c) > self.assertEqual(str(var_p), hgvs_p) E AssertionError: 'NP_004371.2:p.(Ile1084MetfsTer3)' != 'NP_004371.2:p.(Ile1084SerfsTer3)' E - NP_004371.2:p.(Ile1084MetfsTer3) E ? ^ ^ E + NP_004371.2:p.(Ile1084SerfsTer3) E ? ^ ^ tests/test_hgvs_assemblymapper.py:227: AssertionError def test_map_of_ins_intron_exon_boundary(self): hgvs_c = "NM_004380.2:c.3251-1_3251insA" var_c = self.hp.parse_hgvs_variant(hgvs_c) var_p = self.vm.c_to_p(var_c) > self.assertEqual(str(var_p), "NP_004371.2:p.(Ile1084AsnfsTer3)") E AssertionError: 'NP_004371.2:p.(Phe1085LeufsTer2)' != 'NP_004371.2:p.(Ile1084AsnfsTer3)' E - NP_004371.2:p.(Phe1085LeufsTer2) E ? ^^ ^^^^ ^ E + NP_004371.2:p.(Ile1084AsnfsTer3) E ? ^^ ^^^^ ^ tests/test_hgvs_variantmapper.py:125: AssertionError

@reece While working on my latest commit I found a way to pull the offset out of this calculation. My understanding in that we had to backup the start position for negative offsets by 1 so that the inserted section precedes the exon rather than ending up right after the first base. You're correct that using offsets here is weird since the offset positions aren't even present in the cdot sequence. Using offset worked but only because for this change only the -1 offset is critical.

With + pos.offset in line 329 in main, tests still pass. The only way that can be true, I think, is that pos.offset is always 0 for all tests. This might be a gap in tests. (Nothing for you to do here.)

reece · 2023-12-18T18:41:35Z

tests/data/sanity_cp.tsv

reece · 2023-12-18T18:41:53Z

tests/test_hgvs_assemblymapper.py

Thanks for the great tests!

b0d0nne11 · 2023-12-19T15:36:42Z

Thanks for the thorough feedback @reece. I've updated the boundary conditions and tests after reviewing it with @gostachowiak. I believe he'll have a longer update soon. I've also rebased on main to pull in the build fix.

b0d0nne11 · 2023-12-20T16:33:46Z

We discovered issues with 2 of the tests I added yesterday. I just pushed a fix for those.

gostachowiak · 2023-12-21T03:44:18Z

@reece

Thank you for your comments and questions. It helped us to identify a few issues and clarify our thinking. After making a few tweaks to the pull request, here is where we ended up:

This pull request only addresses insertions of bases directly at the intron-exon and exon-intron boundary. It does NOT address duplications of regions that span the boundary.
Scenarios. There are 6 ways to have an insertion at an intron-exon or exon-intron boundary.
1. insertion between -1 and 0
2. insertion between 0 and +1
3. duplication of a region that ends at -1
4. duplication of a region that starts at +1
5. duplication of a region that starts at the beginning of an exon
6. duplication of a region that ends at the end of an exon
Cases 1-4 correspond to the 4 checks in altseqbuilder.py, where it assigns these as "EXON"
- Before this PR, these variants would always be assigned as "INTRON" since by definition at least one position has an offset
Cases 5 & 6 aren't handled in the pull request because usually in these cases neither position has an offset, so these types of variants are already receiving a pdot value
Because this pull request only deals with insertions directly at the boundaries, none of the scenarios in the image you uploaded are covered
Here is an image showing types of insertions at boundaries where the numbering corresponds to the numbering above.
- The arrows for numbers 1 and 2 represent insertions
- The horizontal bars represent duplicated regions
- Green = covered by this pull request (i.e. assigned as EXON). Red = not covered
- In the current updated pull request, all scenarios corresponding to 1-4 are covered
- For scenarios 5 and 6, which weren't changed in this pull request, there are some large duplications that would still be considered intronic
  - We do not intend to address this, because (1) it seems quite difficult to turn the red lines green given what information we have at this point in the code, and (2) once you start duplicating entire introns and/or exons I think all bets are off anyway, and the output of the hgvs package is no longer a sure guide to what the actual effect will be.

reece

Thanks for the improvements. Please check my comments below. This is some thorny code and I appreciate your diving into it.

reece · 2024-01-02T02:34:27Z

src/hgvs/assemblymapper.py

+                    var_g = super(AssemblyMapper, self).c_to_g(var, alt_ac, alt_aln_method=self.alt_aln_method)
+                if var.type == 'n':
+                    var_g = super(AssemblyMapper, self).n_to_g(var, alt_ac, alt_aln_method=self.alt_aln_method)
+                if var.type == 't':


t is a shorthand in functions for var.type in "cnr", but var.type is not ever set to "t"

Removed that case.

reece · 2024-01-02T02:36:54Z

src/hgvs/assemblymapper.py

@@ -241,6 +243,26 @@ def _alt_ac_for_tx_ac(self, tx_ac):
        assert len(alt_acs) == 1, "Should have exactly one alignment at this point"
        return alt_acs[0]

+    def _replace_reference(self, var):
+        if (getattr(var.posedit.pos.start, 'offset', 0) != 0 or getattr(var.posedit.pos.end, 'offset', 0) != 0):


What's the value of this check here?

I think that you could obviate the getattr with structure like this:

def _replace_reference(self, var): if var.type in "cnr": # var.posedit.pos.start and .end will be BaseOffsetPositions, so no getattr needed # be on your merry way...

Removed the new references to getattr. This is cleaner. Thanks for the suggestion!

reece · 2024-01-02T02:38:21Z

src/hgvs/assemblymapper.py

@@ -241,6 +243,26 @@ def _alt_ac_for_tx_ac(self, tx_ac):
        assert len(alt_acs) == 1, "Should have exactly one alignment at this point"
        return alt_acs[0]

+    def _replace_reference(self, var):


It seems to me that we should have only one _replace_reference. It's currently in variantmapper. Is there a reason that one doesn't work / can't be extended?

Yes. In order to get the reference value for these variants we need to first map them back to the var_g form which requires an assembly mapper afaik. We don't actually call the *_to_g function from the AM because that includes additional normalization. Instead we use the AM to get the alt_ac and then call *_to_g from the VM with the alt_ac and alt_aln_method from the AM. In all other cases this method simply calls _replace_reference from the variant mapper instead. This is all so that we can get a portion of the reference sequence from the intronic section.

The AssemblyMapper has VariantMapper as a parent class. In that light, do we need to have a method of the same name? I think this is confusing, since this one does something different. Also, the AM has a normalize property. Could that get used to prevent what seem to be an undesired side-effects? I assume the intent is not to call _maybe_normalize?

Hi @andreasprlic,

I don't think I understand your first question. This _replace_reference method overrides the method of the same name from the parent VariantMapper class. It has the same purpose and type signature. The only difference is that I've added logic to handle c and n type variants with non-zero offsets.

I thought about disabling self.normalize temporarily in AssemblyMapper to avoid calling the _maybe_normalize method but I would also have to make sure the original value of self.normalize is replaced afterward, probably with a try...finally block. I thought it was simpler to call the underlying functions directly but I can rework this if you'd prefer.

Thanks!

src/hgvs/utils/altseqbuilder.py

reece · 2024-01-02T04:06:43Z

src/hgvs/variantmapper.py

@@ -483,8 +483,8 @@ def _replace_reference(self, var):
        else:
            pos = var.posedit.pos

-        seq_start = pos.start.base - 1
-        seq_end = pos.end.base
+        seq_start = pos.start.base + getattr(pos.start, 'offset', 0) - 1


How about something like:

if var.type in "cnr": # assume pos have offset else: # don't have offset

The subtext here is that I find getattr hard to read because it obfuscates the cases that we're trying to handle. I try mightily to avoid getattr with defaults.

Same as above.

reece · 2024-01-02T04:18:15Z

src/hgvs/utils/altseqbuilder.py

@@ -326,7 +340,7 @@ def _setup_incorporate(self):
                    result = cds_start - 1
                else:  # cds/intron
                    if pos.offset <= 0:
-                        result = (cds_start - 1) + pos.base - 1
+                        result = (cds_start - 1) + pos.base + pos.offset - 1


With + pos.offset in line 329 in main, tests still pass. The only way that can be true, I think, is that pos.offset is always 0 for all tests. This might be a gap in tests. (Nothing for you to do here.)

b0d0nne11 · 2024-01-02T19:49:30Z

@reece I made some changes based on your feedback but I'm still doing some more testing on my end. I'll comment here once it passes. Thanks!

b0d0nne11 · 2024-01-03T13:37:08Z

Hey @reece, my testing all checked out fine with the latest changes. Thanks again for your feedback.

reece

@gostachowiak @b0d0nne11 : I need your help to think this change through. What I think you're doing is filling out (missing) intronic sequence using a genomic reference, and then using that to make a prediction about p.

If that's right, then, in principle, the result depends on the intronic sequence. Since the intronic sequence will depend on which assembly is used, the exact intronic sequence that's fetched might vary. This is why hgvs is so stubborn about boundaries and missing sequence and warnings when using intronic positions.

Do I understand your proposed change correctly? If so, aren't we building in an implicit dependency on an arbitrary choice of assembly?

gostachowiak · 2024-03-05T12:32:44Z

@reece

Yes, I think you understand correctly-- the result depends on the intronic sequence.

I think the first step is to see if there's agreement on the problem we're trying to solve.
If I have variant:
c.132-2_132-1dup

This is inserting material after the splice site, in the coding region. (even though both positions in the nomenclature have offsets).
What bases have been inserted? It's unspecified. It just says "dup".

Currently an empty pdot is returned, which I believe is wrong. The inserted material is part of the coding region and will produce a change to the amino acid sequence.

If the current behavior is wrong, then I can think of a couple ways to fix it:

Instead of returning an empty pdot, return p.? or some sort of error saying that the pdot cannot be calculated because it depends on the intronic sequence.
Or, attempt to calculate the pdot.

To calculate the pdot, we need to know what the duplicated bases are. The only way we could come up with is to look back into the reference sequence.

Our typical use case is take something from a VCF file (chr/start/ref/alt) --> calculate cdot --> calculate pdot. We need this to work. In this situation it seems safe to go back to GRCh37 or GRCh38 to figure out what the duplicated bases actually are, since the VCF file said it explicitly. I think I may be naive about the "arbitrary choice of assembly" problem, possibly because we're just trying to do straightforward GRCh37/38 --> cdot --> pdot. In what situation would this procedure fail?

Tangentially, this appears to be a clear drawback of not listing out the duplicated bases in HGVS nomenclature. It makes c.132-2_132-1dup completely ambiguous, with no way to figure out what was actually inserted.

b0d0nne11 · 2024-03-05T14:29:32Z

Also, this might be semantics but we're not using an arbitrary assembly in the sense that we're picking GRCh37 or GRCh38 randomly. Rather, we're using the assembly tied to the AssemblyMapper class used to perform the mapping. This is why I had to add that thrown exception to VariantMapper in these cases since VM would give incorrect results that are inconsistent with your choice of AM.

Thinking about this now I might be able to avoid requiring AM if I change the function signature for VariantMapper.c_to_p to c_to_p(self, var_c, alt_ac, alt_aln_method=hgvs.global_config.mapping.alt_aln_method): but getting the alt_ac would still require an assembly AFAIK. This might even be cleaner from an interface perspective than throwing the exception but would probably break more code. Let me know if you want to me try.

github-actions · 2024-04-05T01:50:53Z

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

reece · 2024-04-15T16:58:07Z

@andreasprlic agreed to investigate here.

b0d0nne11 · 2024-05-24T15:50:41Z

Rebased on main

b0d0nne11 requested a review from reece as a code owner October 26, 2023 14:33

gostachowiak reviewed Nov 15, 2023

View reviewed changes

src/hgvs/utils/altseqbuilder.py Outdated Show resolved Hide resolved

b0d0nne11 force-pushed the 655-intron-exon-boundaries branch from 624f045 to ed861bd Compare December 6, 2023 16:56

reece reviewed Dec 18, 2023

View reviewed changes

b0d0nne11 force-pushed the 655-intron-exon-boundaries branch from c113fa2 to 43b08ec Compare December 19, 2023 15:35

b0d0nne11 mentioned this pull request Dec 19, 2023

c_to_p at intron/exon boundary where splice region is preserved #714

Open

reece requested changes Jan 2, 2024

View reviewed changes

b0d0nne11 force-pushed the 655-intron-exon-boundaries branch from 444bd03 to 22705e2 Compare January 23, 2024 19:27

b0d0nne11 requested a review from a team as a code owner January 23, 2024 19:27

b0d0nne11 force-pushed the 655-intron-exon-boundaries branch from 22705e2 to dc8d9e6 Compare February 8, 2024 14:25

reece reviewed Mar 5, 2024

View reviewed changes

github-actions bot added the stale Issue is stale and subject to automatic closing label Apr 5, 2024

jsstevenson removed the stale Issue is stale and subject to automatic closing label Apr 5, 2024

korikuzma requested a review from reece April 5, 2024 12:19

reece requested a review from andreasprlic April 15, 2024 16:58

b0d0nne11 force-pushed the 655-intron-exon-boundaries branch from dc8d9e6 to 33c28c9 Compare May 24, 2024 15:50

b0d0nne11 added 6 commits May 24, 2024 11:52

Fix variant region for ins and dup on intron-exon boundary

9e42163

Update boundary conditions

3b6ddb3

Fix tests

0cb01ed

Add tests for dups spanning boundary

050be76

Remove new getattr references

b4279c8

Update test cache

58a780e

b0d0nne11 force-pushed the 655-intron-exon-boundaries branch from 33c28c9 to 58a780e Compare May 24, 2024 15:52

Fix variant region for ins and dup on intron-exon boundary #709

Are you sure you want to change the base?

Fix variant region for ins and dup on intron-exon boundary #709

Conversation

b0d0nne11 commented Oct 26, 2023

gostachowiak commented Nov 15, 2023

reece commented Nov 15, 2023

gostachowiak commented Nov 29, 2023 • edited

reece commented Dec 18, 2023 • edited

reece left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b0d0nne11 Dec 20, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b0d0nne11 commented Dec 19, 2023

b0d0nne11 commented Dec 20, 2023

gostachowiak commented Dec 21, 2023 • edited

reece left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

andreasprlic Apr 22, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

b0d0nne11 commented Jan 2, 2024

b0d0nne11 commented Jan 3, 2024

reece left a comment • edited

Choose a reason for hiding this comment

gostachowiak commented Mar 5, 2024

b0d0nne11 commented Mar 5, 2024

github-actions bot commented Apr 5, 2024

reece commented Apr 15, 2024

b0d0nne11 commented May 24, 2024

gostachowiak commented Nov 29, 2023 •

edited

reece commented Dec 18, 2023 •

edited

b0d0nne11 Dec 20, 2023 •

edited

gostachowiak commented Dec 21, 2023 •

edited

andreasprlic Apr 22, 2024 •

edited

reece left a comment •

edited