Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

c_to_p for dups that start in transcript but end in UTR #715

Open
b0d0nne11 opened this issue Jan 4, 2024 · 8 comments · May be fixed by #716
Open

c_to_p for dups that start in transcript but end in UTR #715

b0d0nne11 opened this issue Jan 4, 2024 · 8 comments · May be fixed by #716
Labels
bug Something isn't working

Comments

@b0d0nne11
Copy link

b0d0nne11 commented Jan 4, 2024

Any duplications with an end position at or past the stop codon should be classified as 3'UTR regardless of start position. Currently mapping NM_153223.3:c.2959_*1dup yields NP_694955.2:p.(Met1?). We believe that should map to NP_694955.2:p.? because all other variants in the UTR map to p.?.

In [1]: var_c = parse('NM_153223.3:c.2959_*1dup')

In [2]: var_p = c_to_p(var_c)

In [3]: var_p
Out[3]: SequenceVariant(ac=NP_694955.2, type=p, posedit=Met1?, gene=None)

In [4]: str(var_p)
Out[4]: 'NP_694955.2:p.Met1?'

We expect this to result in NP_694955.2:p.? instead.

@jsstevenson jsstevenson added the bug Something isn't working label Jan 4, 2024
@b0d0nne11 b0d0nne11 linked a pull request Jan 4, 2024 that will close this issue
@b0d0nne11
Copy link
Author

After discussing this internally we think this also applies similarly to insertions.

In [1]: var_c = parse('NM_004985.4:c.567_*1insCCC')

In [2]: var_p = c_to_p(var_c)

In [3]: str(var_p)
Out[3]: 'NP_004976.2:p.(Ter189Ter)'

We expect this to also return p.?. I'll extend my PR to handle these cases.

@reece
Copy link
Member

reece commented Mar 5, 2024

I agree that the current responses for both examples are wrong. However, what it should be is less clear to me.

Can you please elaborate on your rationale for p.? in these cases?

@gostachowiak
Copy link

@reece
For the mutations affected by this pull request, the entire coding sequence is unchanged and the added material is within the 3' UTR.

c.39_*1insA

  • material is inserted after the last base of the stop codon

c.12_*1dup

  • material is inserted after the 1st base of the 3' UTR

Therefore, these are 3' UTR mutations. All other 3' UTR mutations get p.?, so these mutations should also get p.?

@andreasprlic
Copy link
Contributor

What is your source for the variant representation of NM_153223.3:c.2959_*1dup ? Did you call g_to_c previously?

If we try to represent the underlying genomic even that causes this variant and use the left-shuffled insertion representation, I believe we end up with NC_000005.10:g.123346517_123346518insATTA. Performing g_to_c on this representation results in NM_153223.3:C.*1_*2insTAAT and c_to_p then yields p.?. So this issue is also related to ins->dup in hgvs conventions.

To be honest, personally I am not a big fan of this hgvs-dup "prioritization" rule. In my opinion this modifies the underlying nature of the genomic event and drastically changes the coordinates. We would be often better off without the representation as dup (for most small variants). Your variant is one of the examples why.

Btw, if I plug in right-shuffled coordinates for this variant I end up with p.(=). I am not sure which of the two hgvs_p is "better".

@gostachowiak
Copy link

@andreasprlic
We are just attempting to follow the guidelines as they exist, which say that if you can represent something as a dup, it must be represented as a dup, and that nomenclature should be 3' shifted. The cdot nomenclature NM_153223.3:c.2959_*1dup is correct HGVS nomenclature according to those rules, and the pull request fixes a bug where the pdot is assigned incorrectly.

They key point for the examples in the pull request is that the inserted material is inserted AFTER the stop codon, in the sense that the ribosome will make it all the way to the stop codon and not encounter any mutation. Therefore, in the pull request these variants are identified as being in the 3' UTR region, and then end up with p.? like any other 3' UTR variant.

To answer your initial question, the cdot NM_153223.3:c.2959_*1dup comes from calling g_to_c on NC_000005.9:g.122682212_122682215dup, which is itself the left-shifted version of the correct gdot (NC_000005.9:g.122682216_122682219dup), because the transcript is negative strand.

@andreasprlic
Copy link
Contributor

@reece I feel this example demonstrates a problem with the hgvs recommendation to represent insertions as duplications where appropriate. The dup changes the underlying nature (coordinates) of the event and as a consequence we have problems with the hgvs_p here. I believe you are involved into some of the future of hgvs discussions. Is the ins->dup recommendation something that could get more nuance? Perhaps on the chromosomal level insertions don't need to get changed to duplications, but this is only recommended for the protein level?

@gostachowiak
Copy link

@andreasprlic @reece
I think we probably all agree that returning p.Met1? is completely wrong for NM_153223.3:c.2959_*1dup.

This pull request returns p.? instead, which is the same thing returned for NM_153223.3:C.*1_*2insTAAT which is what the cdot would be if hgvs guidelines were changed to eliminate dups.

Based on that, can this pull request be merged, and future changes to hgvs guidelines be dealt with separately?

@gostachowiak
Copy link

@andreasprlic @reece
by the way, the pull request also fixes non-duplication insertions just after the stop codon. The second unit test added is
NM_004985.4:c.567_*1insCCC --> p.?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants