Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Annotation of recent assembly genome using MetaEuk #58

Open
BenAawf opened this issue Feb 15, 2023 · 1 comment
Open

Annotation of recent assembly genome using MetaEuk #58

BenAawf opened this issue Feb 15, 2023 · 1 comment

Comments

@BenAawf
Copy link

BenAawf commented Feb 15, 2023

Hello,
I would like to have some recommendations for using this wonderful tool. First, my goal is to annotate the protein-coding gene and then combine the MetaEuk gff3 annotation file with the Braker2 gff3 file using a combiner such as (EvidenceModeler..).
I understand that to use MetaEuk, I need the assembled file in fasta format (in my case, 100 scaffolds) is an avian species with a genome size of around 1G.
So based on my humble understanding of the homology-based approach, I searched NCBI to create a sequence database of protein, so I used All available Galliformes taxi 8976, which I consider close-related species of my genome. This is the command I used to create a target protein database: esearch -db protein -query "Galliformes [ORGN] AND refseq [filter]" | efetch -format fasta > Galliformes_proteins2.Refseq-425067.faa.
Then metaeuk Version: 6.a5d39d9 via conda for prediction in easy-mode : metaeuk easy-predict $genome $proteinDB $prefixprediction_name $tempFolder

  • To assess the annotation completeness, I used busco in protein mode, and I fed it with MetaEuk.fas against the aves_odb10. The completeness BUSCOs of the genome was 97% and 90% for the annotation of MetaEuk well, it is much better than the annotation of Braker2.

First of all, I'm not sure this is a good approach to take.

Second related to the MetaEuk output files :

  • Protein file output
    The protein sequences predicted contain some lower-case characters. Does that affect BUSCOs evaluations? I assume this behavior in the MetaEuk.fas because my genome is Soft-masked. But I don't know if that supposes any issue for downstream analysis.
  • gff file output
    I know that the gff file only contained coding regions, but my question is it compatible with EvidenceModeler gff3 input?

Third: Regarding this genome which OrthoDB did you recommend to me to train MetaEuk? The protein DB I used might not be enough for a good annotation of the protein-coding gene. I was thinking of using the whole vertebrata_odb10.fasta, but I couldn't find the link to download it in fasta format. But I'm still trying to figure out the correct way.

I would appreciate your feedback.
Thank you.
ben

@elileka
Copy link
Member

elileka commented Feb 15, 2023

Hi Ben,

We're very happy you find the tool useful. It will take me some time to address your questions so I apologize in advance. In the meantime, it might be useful for you to read this section, which details an easy way to download reference databases and filter them according to taxonomy.
You could download UniRef90 for example and then filter it to contain only vertebrata (7742), for example.

Best,
Eli

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants