Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distant exon prediction #83

Open
Maxim-Karpov opened this issue May 7, 2024 · 2 comments
Open

Distant exon prediction #83

Maxim-Karpov opened this issue May 7, 2024 · 2 comments

Comments

@Maxim-Karpov
Copy link

Hello @sagnikbanerjee15,
During the CodAn step, the execution terminates due to the missing annotation.gtf file when findCDS function is launched. This is because CodAn process terminates due to duplicate entries found in the combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file. Upon closer inspection of the gtf file, it can be seen that these duplicate FASTA entries arise from transcripts with predicted exons that are very far apart, and are therefore split into two FASTA entries. For example:

1	FINDER	transcript	57285934	67077786	1000	-	.	gene_id "1.1714_0_covsplit"; transcript_id "1.1714_0_covsplit.0"; FPKM "0.140324"; TPM "1.128119"; cov "1.697479"; 

1	FINDER	exon	57285934	57285945	1000	-	.	gene_id "1.1714_0_covsplit"; transcript_id "1.1714_0_covsplit.0"; FPKM "0.140324"; TPM "1.128119"; cov "1.697479"; 

1	FINDER	exon	67069393	67069715	1000	-	.	gene_id "1.1714_0_covsplit"; transcript_id "1.1714_0_covsplit.0"; FPKM "0.140324"; TPM "1.128119"; cov "1.697479"; 

1	FINDER	exon	67077589	67077786	1000	-	.	gene_id "1.1714_0_covsplit"; transcript_id "1.1714_0_covsplit.0"; FPKM "0.140324"; TPM "1.128119"; cov "1.697479"; 

OR

1	FINDER	transcript	52209472	65689477	1000	-	.	gene_id "1.1561_1_covsplit"; transcript_id "1.1561_1_covsplit.0"; FPKM "0.272933"; TPM "2.573257"; cov "6.587509"; 

1	FINDER	exon	52209472	52209482	1000	-	.	gene_id "1.1561_1_covsplit"; transcript_id "1.1561_1_covsplit.0"; FPKM "0.272933"; TPM "2.573257"; cov "6.587509"; 

1	FINDER	exon	65688877	65689477	1000	-	.	gene_id "1.1561_1_covsplit"; transcript_id "1.1561_1_covsplit.0"; FPKM "0.272933"; TPM "2.573257"; cov "6.587509"; 

OR

1	FINDER	transcript	38671110	52554711	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38671110	38671164	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38672247	38672348	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38673558	38673756	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38674078	38674319	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38675099	38675342	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38675872	38676033	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38677051	38677323	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	38677638	38677885	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

1	FINDER	exon	52554169	52554711	1000	-	.	gene_id "1.1108_1_covsplit"; transcript_id "1.1108_1_covsplit.1"; FPKM "0.241149"; TPM "2.245845"; cov "4.710058"; 

This seems to be an issue only with the FINDER-predicted transcripts and does not occur with the PsiCLASS transcripts. It is quite laborious for me to examine the issue further, so I was hoping that you could provide some explanations as to why/where this could be occurring and whether there could be a quick fix? Complete removal of these entries from the combined_split_transcripts_with_bad_SJ_redundancy_removed files does let the run complete, however, it is not ideal. I was thinking of simply splitting these entries into two separate transcripts using a custom script which should hopefully work, but does not tell much about the source of the issue.

@Maxim-Karpov
Copy link
Author

Maxim-Karpov commented May 7, 2024

Upon further research, it seems that, due to the long intron length, gffread (having an intrinsic intron length limit of 6Mb) splits these exons into two separate entries. The emergence of these large transcripts is likely due to the alignIntronMax parameter of the STAR aligner in the 3rd round of alignment being set to 100000000. I think the best thing to do is to concatenate the duplicate FASTA entries into one, even though this is unlikely to form a legitimate transcript due to such a large intron.

@Maxim-Karpov
Copy link
Author

Maxim-Karpov commented May 8, 2024

I wrote a script to concatenate the duplicate sequences inside the combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta file. I was able to resolve the gffread issue by rebuilding the singularity container in sandbox mode and editing the predictCDS.py to include the running of this script inside the "../assemblies_psiclass_modified/combined/" directory right after the gffread step in the findCDS function. Restarting from checkpoint 5, the operation completed successfully. Requires seqkit to be installed into the singularity sandbox.

cat combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta | grep ">" | sort | uniq -d > fasta.uniq

cat fasta.uniq | cut -c 2- > fasta.uniq.tmp

mv fasta.uniq.tmp fasta.uniq



touch duplicated.gtf

touch duplicated.fasta



cat fasta.uniq | while read line || [[ -n $line ]];

do

cat combined_split_transcripts_with_bad_SJ_redundancy_removed.gtf | grep $line >> duplicated.gtf

done



seqkit grep -n -f fasta.uniq combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta > duplicated.fasta



cat fasta.uniq | while read line || [[ -n $line ]];

do

seqkit grep duplicated.fasta --pattern $line > ${line}.dupset.fasta

done



for FILE in *.dupset.fasta

do

seq_num=1

while IFS= read -r line; do

    if [[ $line == ">"* ]]; then

        filename="${FILE}_${seq_num}.fasta"       

        ((seq_num++))

        echo $line > "$filename"

    else

        echo $line >> "$filename"

    fi

done < "$FILE"



if [[ $seq_num -ge 4 ]]

then

seqkit concat ${FILE}_1.fasta ${FILE}_2.fasta ${FILE}_3.fasta >> concat.fasta

else

seqkit concat ${FILE}_1.fasta ${FILE}_2.fasta >> concat.fasta

fi

rm -r ${FILE}*

done



perl -ne 'if(/^>(\S+)/){$c=!$i{$1}}$c?print:chomp;$i{$_}=1 if @ARGV' fasta.uniq combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta > combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta.tmp

cat concat.fasta >> combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta.tmp

mv combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta.tmp combined_split_transcripts_with_bad_SJ_redundancy_removed.fasta



rm duplicated.fasta duplicated.gtf fasta.uniq concat.fasta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant