Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

really long introns create a problem with gffread #53

Open
jkreplak opened this issue Mar 24, 2022 · 4 comments
Open

really long introns create a problem with gffread #53

jkreplak opened this issue Mar 24, 2022 · 4 comments
Assignees
Labels
help wanted Extra attention is needed

Comments

@jkreplak
Copy link

Hi,
My Finder crashed during FindCDS at the codan steps. Codan throw an error about duplicate key :

ValueError: Duplicate key 'chr6:1-94332053.99_10_covsplit.0'

After checking this error, i found in the gtf file combined_split_transcripts_with_bad_SJ_redundancy_removed.gtf
around 650 transcripts with ultra-long introns :

chr1    FINDER  transcript      46339202        72807906        1000    +       .       gene_id "chr1.24374_0_covsplit"; transcript_id "chr1.24374_0_covsplit.0"; FPKM "90.359299"; TPM "675.727723"; cov "3011.853909"; 
chr1    FINDER  exon    46339202        46340469        1000    +       .       gene_id "chr1.24374_0_covsplit"; transcript_id "chr1.24374_0_covsplit.0"; FPKM "90.359299"; TPM "675.727723"; cov "3011.853909"; 
chr1    FINDER  exon    72807879        72807906        1000    +       .       gene_id "chr1.24374_0_covsplit"; transcript_id "chr1.24374_0_covsplit.0"; FPKM "90.359299"; TPM "675.727723"; cov "3011.853909"; 

When I relaunch gffread, i can see that the software is creating at least two fasta entry for this transcript, explaining the duplicate message of codan. After checking the web, it seems that gffread has an intron limit size.

How can I go around that to finish the pipeline ?

Thanks,
Jonathan

@sagnikbanerjee15 sagnikbanerjee15 self-assigned this Mar 24, 2022
@sagnikbanerjee15 sagnikbanerjee15 added the help wanted Extra attention is needed label Mar 24, 2022
@sagnikbanerjee15
Copy link
Owner

Hello @jkreplak,

Thank you so much for your interest in finder and thank you for reporting this issue. I have not encountered this issue before and I need to think of a solution. I will keep you posted.

Thank you.

@sagnikbanerjee15
Copy link
Owner

sagnikbanerjee15 commented Mar 25, 2022 via email

@jkreplak
Copy link
Author

Hello,
So I was able to modify findCDS command to use another tool (agat) to create fasta and it worked till the end. There is a few really large introns ( > 6Mb) caused by bad UTR splicing in some isoforms. I'll remove them by hand.
I was just surprised by the final gtf format. As the CDS doesn't include stop codon and there are no stop codons lines for each transcript, tools (gffread, agat, jcvi...) are calling them 3' partial. It's difficult to work with the way I do it usually.

@sagnikbanerjee15
Copy link
Owner

Hello @jkreplak,

Thank you for reporting this issue. I will create a program that will you can use to filter out certain genes and transcripts. Could you please post a few examples of the output where stop codons are missing? Since the result is being output in GTF mode, there will not be any STOP codon annotated. But later versions of the software will have the option to request for such annotations as well.

Thank you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

2 participants