Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possibly incorrect determination of the frame #190

Open
anoronh4 opened this issue Mar 17, 2023 · 5 comments
Open

possibly incorrect determination of the frame #190

anoronh4 opened this issue Mar 17, 2023 · 5 comments

Comments

@anoronh4
Copy link

anoronh4 commented Mar 17, 2023

We have a fusion called SCP2-NTRK1 and we are curious as to why the fusion is being called as out of frame. The nt sequence as reported by arriba is as follows:

ATAATTTAGGCATTGGAGGAGCTGTGGTTGTAACACTCTACAAGATGGGTTTTCCGGAAGCCGCCAG___TTCTTTTAGAACTCATCAAATTGAAGCTGTTCCAACCAGCTCTGCAAGTGATGGATTTAAGGCAAATCTTGTTTTTAAGGAGATTGAGAAGAAACTTGAAGAG___TCTCGCTCTGTCACCCAGGCTGGAGTGCAGTGGCGCGATCTTGGCTCACTGCAACCTCCACCTCCCGGGTTCAAGCGATTCTCGTGCCTCAGCCTCCCGAGTAGCTGGCATTACAGGCACGTGCCACCACACCCAG|GACGGAGAAACAAGTTTGGGATCAACC___GCCCGGCTGTGCTGGCTCCAGAGGATGGGCTGGCCATGTCCCTGCATTTCATGACATTGGGTGGCAGCTCCCTGTCCCCCACCGAGGGCAAAGGCTCTGGGCTCCAAGGCCACATCATCGAGAACCCACAATACTTCAGTGATGCCTGTGAGGGGCTATGCTGGG...CAAGGGCAGGGACGA...___GTGTTCACCACATCAAGCGC

The sequence between the breakpoint | the last ___ before it is actually intron sequence (site1 == intron and site2 == CDS). The aa sequence that arriba predicts is as follows:

NLGIGGAVVVTLYKMGFPEAASSFRTHQIEAVPTSSASDGFKANLVFKEIEKKLEEsrsvtqagvqwrdlgslqppppgfkrfsclslpsswhyrhvpphp|grrnkfginrpavlapedglamslhfmtlggsslsptegkgsglqghiienpqyfsdaceglcw

we are a bit confused because the protein sequence after the breakpoint looks to be in-frame. so i'm just wondering what about this causes arriba to say it is out-of-frame.

by the way, we are using 2.3.0

@anoronh4 anoronh4 changed the title interpretation of the frame possibly incorrect determination of the frame Mar 17, 2023
@suhrig
Copy link
Owner

suhrig commented Mar 17, 2023

What's the transcript that Arriba reports in column transcript_id2? Only if the protein sequence is in-frame with respect to this transcript, then Arriba will consider the fusion protein to be in-frame. Of course, this merely shifts the problem from incorrect reading frame prediction to incorrect transcript selection, but this would explain why Arriba considers the fusion to be a frameshift, even though it looks like an in-frame fusion, and this is where I would have to start to fix the problem.

@anoronh4
Copy link
Author

anoronh4 commented Mar 17, 2023

ENST00000530298 is the transcript_id2 which is a noted in Ensembl as a retained intron...which is weird because arriba also says that site2 is CDS. is this an instance of incorrect transcript selection, as you mentioned in your comment? how is a particular transcript prioritized?

@suhrig
Copy link
Owner

suhrig commented Mar 24, 2023

I have taken a close look at this case. The reason why Arriba chooses the intron retention transcript is because there is some intron retention going on. Namely, the bases GTGAGGGGCTATGCTGGGTCAAGGGCAGGGACGA are transcribed as part of the fusion, which are intronic bases.

Arriba chooses the transcript based on the expression level and splice patterns. Basically, it assembles a transcript sequence from the fusion-supporting reads. When there are multiple transcripts being expressed at the same time, it picks the one with the highest expression. Then, it tries to match this transcript sequence to the most closely matching annotated transcript ID. In this case, Arriba comes to the conclusion that it's the transcript ID with intron retention, because of a small stretch of intron being transcribed.

This is a bit of an ambiguous case, because there is both intronic transcription and splicing. The transcribed bases do not match any annotated transcript perfectly. And although it's not entirely wrong of Arriba to choose an intron retention transcript, it could have chosen one of the transcripts without intron retention, and it would be as good of a match.

I suspect that Arriba did not determine the highest expressed transcript correctly. This may be the real mistake here. Can you send me an IGV screenshot of the region chr1:156,875,483-156,876,233 (hg38 coordinates) or chr1:156,845,289-156,846,025 (hg19 coordinates)? I would like to see the coverage in this region to confirm that the above mentioned short intronic stretch really is the dominant transcript in this region.

@anoronh4
Copy link
Author

anoronh4 commented Mar 28, 2023

arriba.zip is this igv report ok for checking out the reads? from what i can see in the alignment, several of the split reads from the fusion are skipping the intronic regions. but i'm not practiced at looking at these.

@suhrig
Copy link
Owner

suhrig commented May 6, 2023

Apologies for the long delay and also for the fact that I still have not come up with a solution yet despite the long time. However, your provided data was very helpful. I can confirm that Arriba did indeed pick the wrong transcript and that this is the underlying cause of the wrong reading frame. I will need some time to play with this example and see if I can improve the transcript selection. My problem is that I have hardly any spare time for this at the moment due to many other obligations. I will get back to you as soon as I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants