Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix variant region for ins and dup on intron-exon boundary #709

Open
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

b0d0nne11
Copy link

Fixes #655

Certain ins or dup variants that occur at the intron/exon boundaries have non-zero offsets but are still coding. This PR ensures that projections return accurate var_p representations for these variants.

Fixing ins variants are pretty straightforward because we know what to insert. Fixing dup variants is more involved because we need to calculate the reference sequence. This reference sequence must be derived from the var_g representation so that we get the correct bases from the intron rather than the previous or following exon. As a result there are some mappings that require AssemblyMapper. In these cases trying to perform the mapping with VariantMapper will throw an HGVSError. In addition, when computing the reference sequence we need to handle strand orientation and avoid normalization.

@b0d0nne11 b0d0nne11 requested a review from reece as a code owner October 26, 2023 14:33
@gostachowiak
Copy link

@reece I am wondering who can help review this pull request? I believe the core pdot calculation logic is currently giving incorrect results, and we are under some pressure to fix it.

I suppose the two tasks are:

  1. is there agreement that there currently is an issue that needs fixing?
  2. if so, is this approach acceptable?

The basic problem is-- when material is inserted exactly at the intron/exon boundary, should the inserted material be treated as part of the coding region, or as part of the intron?

We believe it is more correct to treat it as part of the coding sequence, because the entire intron and splicing region is intact. So in principle, splicing can still occur using the original splice site.

The current behavior is inconsistent. Insertions at the exon-intron boundary are treated as coding (because the cdot nomenclature doesn't include any intronic positions). Insertions at the intron-exon boundary are treated as non-coding (because the cdot nomenclature does include intronic positions). But I can't think of any reason these situations would be different biologically.

This PR both (1) makes the behavior consistent (always treats the inserted material as coding), and (2) makes the behavior correct (if you're convinced by the argument above).

This PR also makes hgvs consistent with Mutalyzer. Examples:

  1. NM_004380.3:c.3251-1_3251insA
  • Mutalyzer: NP_004371.2:p.(Ile1084Asnfs*3)
  • hgvs, current: null
  • hgvs, after this PR: NP_004371.2:p.(Ile1084AsnfsTer3)
  1. NM_024529.4:c.132-2_132-1dup
  • Mutalyzer: NP_078805.3:p.(Thr45Glyfs*65)
  • hgvs, current: null
  • hgvs, after this PR: NP_078805.3:p.(Thr45GlyfsTer65)

@reece
Copy link
Member

reece commented Nov 15, 2023

Hi @gostachowiak: thanks for pinging me. I'll take a look this week and comment here.

@gostachowiak
Copy link

gostachowiak commented Nov 29, 2023

@reece reminder about this PR. If it would be easier to have a meeting to review together, we would be happy to do that. We will need to consider forking soon (maintaining a separate private version) if fixing this isn't an option. Our users consider this to be a serious flaw.

@reece
Copy link
Member

reece commented Dec 18, 2023

@cassiemk @gostachowiak @b0d0nne11 : I'm very sorry that I let this issue fall off my radar. I appreciate the thoughtfulness of the original issue (#655) and the discussion here, and the completeness of the implementation. I have one question, which I'll pick up in a review in the next few minutes.

And I agree with you that the implementation is inconsistent, or at least not well-defined. Thank you for addressing it.

Copy link
Member

@reece reece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR is very nearly ready. Please see the comments in altseqbuilder.py.

elif self._var_c.posedit.edit.type == "ins" and self._var_c.posedit.pos.start.offset == 0 and self._var_c.posedit.pos.end.offset == 1:
# ins at exon-intron boundary
result = self.EXON
elif self._var_c.posedit.edit.type == "dup" and self._var_c.posedit.pos.start.offset <= -1 and self._var_c.posedit.pos.end.offset >= -1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What cases of boundary-spanning dups are intended in this conditional?

I believe that the intended case is a dup that spans an intron-exon boundary (i.e., on the 5' end of an exon). However, I believe that misses a set of these cases and may match unintended cases. I've summarized these here:

image

Please comment on the following:

  • self._var_c.posedit.pos.end.offset >= -1: Why not == 0, which is more specific? I believe that end is intended to be in the CDS.
  • Which of the cases do you intend to catch in the figure?

I'm happy to talk more about this interactively if useful.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reece Thanks for the comment. The intention in this PR is to handle insertions that are directly at the boundary (not spanning the boundary). So self._var_c.posedit.pos.end.offset >= -1 should be updated to self._var_c.posedit.pos.end.offset = -1 (because a duplication in the left intron will insert material directly at the boundary if the end position is -1).

So in your image:

  1. Not intended to be handled (fixed with the change above)
  2. Not intended to be handled. And, no special logic is needed for the right end of the exon, because a duplication that inserts material at the boundary has no offsets, so is already considered exonic.

2, 3, 4. Good questions, we are thinking about these.

@@ -326,7 +340,7 @@ def _setup_incorporate(self):
result = cds_start - 1
else: # cds/intron
if pos.offset <= 0:
result = (cds_start - 1) + pos.base - 1
result = (cds_start - 1) + pos.base + pos.offset - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if this line is covered by tests prior to this PR? It's suspicious to me that a bug fix of this magnitude doesn't have broader impacts.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was a little surprised too but if I remove it the following tests fail:

    def test_map_of_dup_intron_exon_boundary(self):
        hgvs_c = "NM_024529.4:c.132-2_132-1dup"
        hgvs_p = "NP_078805.3:p.(Thr45GlyfsTer65)"
    
        var_c = self.hp.parse_hgvs_variant(hgvs_c)
        var_p = self.am.c_to_p(var_c)
    
>       self.assertEqual(str(var_p), hgvs_p)
E       AssertionError: 'NP_078805.3:p.(Thr45ArgfsTer65)' != 'NP_078805.3:p.(Thr45GlyfsTer65)'
E       - NP_078805.3:p.(Thr45ArgfsTer65)
E       ?                     ^^^
E       + NP_078805.3:p.(Thr45GlyfsTer65)
E       ?                     ^^^

tests/test_hgvs_assemblymapper.py:210: AssertionError

    def test_map_of_dup_intron_exon_boundary_rc(self):
        hgvs_c = "NM_004380.2:c.3251-1dup"
        hgvs_p = "NP_004371.2:p.(Ile1084SerfsTer3)"
    
        var_c = self.hp.parse_hgvs_variant(hgvs_c)
        var_p = self.am.c_to_p(var_c)
    
>       self.assertEqual(str(var_p), hgvs_p)
E       AssertionError: 'NP_004371.2:p.(Ile1084MetfsTer3)' != 'NP_004371.2:p.(Ile1084SerfsTer3)'
E       - NP_004371.2:p.(Ile1084MetfsTer3)
E       ?                       ^ ^
E       + NP_004371.2:p.(Ile1084SerfsTer3)
E       ?                       ^ ^

tests/test_hgvs_assemblymapper.py:227: AssertionError

    def test_map_of_ins_intron_exon_boundary(self):
        hgvs_c = "NM_004380.2:c.3251-1_3251insA"
        var_c = self.hp.parse_hgvs_variant(hgvs_c)
        var_p = self.vm.c_to_p(var_c)
>       self.assertEqual(str(var_p), "NP_004371.2:p.(Ile1084AsnfsTer3)")
E       AssertionError: 'NP_004371.2:p.(Phe1085LeufsTer2)' != 'NP_004371.2:p.(Ile1084AsnfsTer3)'
E       - NP_004371.2:p.(Phe1085LeufsTer2)
E       ?                ^^    ^^^^     ^
E       + NP_004371.2:p.(Ile1084AsnfsTer3)
E       ?                ^^    ^^^^     ^

tests/test_hgvs_variantmapper.py:125: AssertionError

Copy link
Author

@b0d0nne11 b0d0nne11 Dec 20, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@reece While working on my latest commit I found a way to pull the offset out of this calculation. My understanding in that we had to backup the start position for negative offsets by 1 so that the inserted section precedes the exon rather than ending up right after the first base. You're correct that using offsets here is weird since the offset positions aren't even present in the cdot sequence. Using offset worked but only because for this change only the -1 offset is critical.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With + pos.offset in line 329 in main, tests still pass. The only way that can be true, I think, is that pos.offset is always 0 for all tests. This might be a gap in tests. (Nothing for you to do here.)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great tests!

@b0d0nne11
Copy link
Author

Thanks for the thorough feedback @reece. I've updated the boundary conditions and tests after reviewing it with @gostachowiak. I believe he'll have a longer update soon. I've also rebased on main to pull in the build fix.

@b0d0nne11
Copy link
Author

We discovered issues with 2 of the tests I added yesterday. I just pushed a fix for those.

@gostachowiak
Copy link

gostachowiak commented Dec 21, 2023

@reece

Thank you for your comments and questions. It helped us to identify a few issues and clarify our thinking. After making a few tweaks to the pull request, here is where we ended up:

  • This pull request only addresses insertions of bases directly at the intron-exon and exon-intron boundary. It does NOT address duplications of regions that span the boundary.
  • Scenarios. There are 6 ways to have an insertion at an intron-exon or exon-intron boundary.
    1. insertion between -1 and 0
    2. insertion between 0 and +1
    3. duplication of a region that ends at -1
    4. duplication of a region that starts at +1
    5. duplication of a region that starts at the beginning of an exon
    6. duplication of a region that ends at the end of an exon
  • Cases 1-4 correspond to the 4 checks in altseqbuilder.py, where it assigns these as "EXON"
    • Before this PR, these variants would always be assigned as "INTRON" since by definition at least one position has an offset
  • Cases 5 & 6 aren't handled in the pull request because usually in these cases neither position has an offset, so these types of variants are already receiving a pdot value
  • Because this pull request only deals with insertions directly at the boundaries, none of the scenarios in the image you uploaded are covered
  • Here is an image showing types of insertions at boundaries where the numbering corresponds to the numbering above.
    • The arrows for numbers 1 and 2 represent insertions
    • The horizontal bars represent duplicated regions
    • Green = covered by this pull request (i.e. assigned as EXON). Red = not covered
    • In the current updated pull request, all scenarios corresponding to 1-4 are covered
    • For scenarios 5 and 6, which weren't changed in this pull request, there are some large duplications that would still be considered intronic
      • We do not intend to address this, because (1) it seems quite difficult to turn the red lines green given what information we have at this point in the code, and (2) once you start duplicating entire introns and/or exons I think all bets are off anyway, and the output of the hgvs package is no longer a sure guide to what the actual effect will be.
Screen Shot 2023-12-20 at 10 43 53 PM

Copy link
Member

@reece reece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the improvements. Please check my comments below. This is some thorny code and I appreciate your diving into it.

var_g = super(AssemblyMapper, self).c_to_g(var, alt_ac, alt_aln_method=self.alt_aln_method)
if var.type == 'n':
var_g = super(AssemblyMapper, self).n_to_g(var, alt_ac, alt_aln_method=self.alt_aln_method)
if var.type == 't':
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

t is a shorthand in functions for var.type in "cnr", but var.type is not ever set to "t"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed that case.

@@ -241,6 +243,26 @@ def _alt_ac_for_tx_ac(self, tx_ac):
assert len(alt_acs) == 1, "Should have exactly one alignment at this point"
return alt_acs[0]

def _replace_reference(self, var):
if (getattr(var.posedit.pos.start, 'offset', 0) != 0 or getattr(var.posedit.pos.end, 'offset', 0) != 0):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the value of this check here?

I think that you could obviate the getattr with structure like this:

def _replace_reference(self, var):
    if var.type in "cnr":
        # var.posedit.pos.start and .end will be BaseOffsetPositions, so no getattr needed
        # be on your merry way...

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed the new references to getattr. This is cleaner. Thanks for the suggestion!

@@ -241,6 +243,26 @@ def _alt_ac_for_tx_ac(self, tx_ac):
assert len(alt_acs) == 1, "Should have exactly one alignment at this point"
return alt_acs[0]

def _replace_reference(self, var):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that we should have only one _replace_reference. It's currently in variantmapper. Is there a reason that one doesn't work / can't be extended?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. In order to get the reference value for these variants we need to first map them back to the var_g form which requires an assembly mapper afaik. We don't actually call the *_to_g function from the AM because that includes additional normalization. Instead we use the AM to get the alt_ac and then call *_to_g from the VM with the alt_ac and alt_aln_method from the AM. In all other cases this method simply calls _replace_reference from the variant mapper instead. This is all so that we can get a portion of the reference sequence from the intronic section.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The AssemblyMapper has VariantMapper as a parent class. In that light, do we need to have a method of the same name? I think this is confusing, since this one does something different. Also, the AM has a normalize property. Could that get used to prevent what seem to be an undesired side-effects? I assume the intent is not to call _maybe_normalize?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @andreasprlic,

I don't think I understand your first question. This _replace_reference method overrides the method of the same name from the parent VariantMapper class. It has the same purpose and type signature. The only difference is that I've added logic to handle c and n type variants with non-zero offsets.

I thought about disabling self.normalize temporarily in AssemblyMapper to avoid calling the _maybe_normalize method but I would also have to make sure the original value of self.normalize is replaced afterward, probably with a try...finally block. I thought it was simpler to call the underlying functions directly but I can rework this if you'd prefer.

Thanks!

src/hgvs/utils/altseqbuilder.py Show resolved Hide resolved
@@ -483,8 +483,8 @@ def _replace_reference(self, var):
else:
pos = var.posedit.pos

seq_start = pos.start.base - 1
seq_end = pos.end.base
seq_start = pos.start.base + getattr(pos.start, 'offset', 0) - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like:

if var.type in "cnr":
    # assume pos have offset
else:
    # don't have offset

The subtext here is that I find getattr hard to read because it obfuscates the cases that we're trying to handle. I try mightily to avoid getattr with defaults.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above.

@@ -326,7 +340,7 @@ def _setup_incorporate(self):
result = cds_start - 1
else: # cds/intron
if pos.offset <= 0:
result = (cds_start - 1) + pos.base - 1
result = (cds_start - 1) + pos.base + pos.offset - 1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With + pos.offset in line 329 in main, tests still pass. The only way that can be true, I think, is that pos.offset is always 0 for all tests. This might be a gap in tests. (Nothing for you to do here.)

@b0d0nne11
Copy link
Author

@reece I made some changes based on your feedback but I'm still doing some more testing on my end. I'll comment here once it passes. Thanks!

@b0d0nne11
Copy link
Author

Hey @reece, my testing all checked out fine with the latest changes. Thanks again for your feedback.

@b0d0nne11 b0d0nne11 requested a review from a team as a code owner January 23, 2024 19:27
Copy link
Member

@reece reece left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gostachowiak @b0d0nne11 : I need your help to think this change through. What I think you're doing is filling out (missing) intronic sequence using a genomic reference, and then using that to make a prediction about p.

If that's right, then, in principle, the result depends on the intronic sequence. Since the intronic sequence will depend on which assembly is used, the exact intronic sequence that's fetched might vary. This is why hgvs is so stubborn about boundaries and missing sequence and warnings when using intronic positions.

Do I understand your proposed change correctly? If so, aren't we building in an implicit dependency on an arbitrary choice of assembly?

@gostachowiak
Copy link

@reece

Yes, I think you understand correctly-- the result depends on the intronic sequence.

  1. I think the first step is to see if there's agreement on the problem we're trying to solve.
    If I have variant:
    c.132-2_132-1dup

This is inserting material after the splice site, in the coding region. (even though both positions in the nomenclature have offsets).
What bases have been inserted? It's unspecified. It just says "dup".

Currently an empty pdot is returned, which I believe is wrong. The inserted material is part of the coding region and will produce a change to the amino acid sequence.

  1. If the current behavior is wrong, then I can think of a couple ways to fix it:
  • Instead of returning an empty pdot, return p.? or some sort of error saying that the pdot cannot be calculated because it depends on the intronic sequence.
  • Or, attempt to calculate the pdot.

To calculate the pdot, we need to know what the duplicated bases are. The only way we could come up with is to look back into the reference sequence.

Our typical use case is take something from a VCF file (chr/start/ref/alt) --> calculate cdot --> calculate pdot. We need this to work. In this situation it seems safe to go back to GRCh37 or GRCh38 to figure out what the duplicated bases actually are, since the VCF file said it explicitly. I think I may be naive about the "arbitrary choice of assembly" problem, possibly because we're just trying to do straightforward GRCh37/38 --> cdot --> pdot. In what situation would this procedure fail?

Tangentially, this appears to be a clear drawback of not listing out the duplicated bases in HGVS nomenclature. It makes c.132-2_132-1dup completely ambiguous, with no way to figure out what was actually inserted.

@b0d0nne11
Copy link
Author

Also, this might be semantics but we're not using an arbitrary assembly in the sense that we're picking GRCh37 or GRCh38 randomly. Rather, we're using the assembly tied to the AssemblyMapper class used to perform the mapping. This is why I had to add that thrown exception to VariantMapper in these cases since VM would give incorrect results that are inconsistent with your choice of AM.

Thinking about this now I might be able to avoid requiring AM if I change the function signature for VariantMapper.c_to_p to c_to_p(self, var_c, alt_ac, alt_aln_method=hgvs.global_config.mapping.alt_aln_method): but getting the alt_ac would still require an assembly AFAIK. This might even be cleaner from an interface perspective than throwing the exception but would probably break more code. Let me know if you want to me try.

Copy link

github-actions bot commented Apr 5, 2024

This PR is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 7 days.

@github-actions github-actions bot added the stale Issue is stale and subject to automatic closing label Apr 5, 2024
@jsstevenson jsstevenson removed the stale Issue is stale and subject to automatic closing label Apr 5, 2024
@korikuzma korikuzma requested a review from reece April 5, 2024 12:19
@reece
Copy link
Member

reece commented Apr 15, 2024

@andreasprlic agreed to investigate here.

@b0d0nne11
Copy link
Author

Rebased on main

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Inconsistencies across intron/exon boundaries
5 participants