Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Split big CDS to subCDS using gff #52

Open
ucabuk opened this issue Nov 8, 2022 · 1 comment
Open

Split big CDS to subCDS using gff #52

ucabuk opened this issue Nov 8, 2022 · 1 comment

Comments

@ucabuk
Copy link

ucabuk commented Nov 8, 2022

Hello Eli,

I want to split one big predicted protein to exons according to their gff file. I have three output .fas .codon.fas .headersMap.tsv and .gff produced by Metaeuk.

In gtf file, CDS coordination is based on assembled contig. So I could not find the information of coordination where exon stop in protein (.fas) output. Basically, what I want to do is that,

This protein contains more than one exon. I want to

UniRef50_A0A699GG08|k127_10391|-|2222|0|11|36114|60000|60000[60000]:59629[59629]:372....
MTNSTHFGYQTVAEEEKVHKVAEVFHSVAAKYDVMNDVMSAGLHRLWKTFTIAQAGIRPGFKVLDIAGGTGDLAKAFAKKAGPTGEVWLTDINESMLRVGRDRLLNNG......

to

>UniRef50_A0A699GG08|k127_10391_CDS0
MTNSTHFGYQTVAEEEKVHKV
>UniRef50_A0A699GG08|k127_10391_CDS1
AEEEKVHKVAEVFHSVAAKYDVM
>UniRef50_A0A699GG08|k127_10391_CDS2
YDVMNDVMSAGLHRLWKTFTIA
>UniRef50_A0A699GG08|k127_10391_CDS3
DLAKAFAKKAGPTGEVWLTDINESMLRVGRDRLLNNG
....

I could not find this information in Metaeuk gff file, This is based on contigs, so I am able to separate it in .codon.fas file using these information, not in output .fas

> k127_10391 MetaEuk CDS 59630 60001 186 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_0;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_0
>k127_10391 MetaEuk CDS 58374 59321 203 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_1;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_1
> k127_10391 MetaEuk CDS 56729 57589 462 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_2;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_2
k127_10391 MetaEuk CDS 50451 50633 126 - . ID=UniRef50_A0A699GG08;TCS_ID=UniRef50_A0A699GG08|k127_10391|-|36114_CDS_3;Parent=UniRef50_A0A699GG08|k127_10391|-|36114_exon_3

Does Metaeuk provide any coordination information regarding splitting of exons in big coding sequence?

Thank you !

@elileka
Copy link
Member

elileka commented Nov 9, 2022

Hi,

I am not 100% sure I have understood your need so please correct me if I am wrong.
It seems like you wish to split each single fasta record to multiple records, one for each exon.
If so, then indeed, MetaEuk does not provide this kind of output but it should be possible to write a script that creates this fasta from the original fasta file*. Each exon is described in the fasta header, separated with pipes from the other exons. The numbers given for each exon are the original coordinates on the contig (please note the possible short overlap between exons. There is one between the first and second in your example). Also note that unlike the report in the MetaEuk header, the GFF coordinates start with index 1, as standard for that format.
https://github.com/soedinglab/metaeuk#the-metaeuk-header

*I could assist with this, if needed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants