Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How can I use metaeuk to annotation genome without reference #46

Open
Nana7m1 opened this issue Apr 30, 2022 · 4 comments
Open

How can I use metaeuk to annotation genome without reference #46

Nana7m1 opened this issue Apr 30, 2022 · 4 comments

Comments

@Nana7m1
Copy link

Nana7m1 commented Apr 30, 2022

Dear developer and other users,
As the title says, I wanna use metaeuk to annotation genome without reference. But I cannot find how to deal with it in manual.

Best
Nana7m1

@elileka
Copy link
Member

elileka commented May 1, 2022

Hello,

The way to do it is to download or construct a reference database to run against. What do you know about your genome? What taxonomic group is it? I could try to provide further advice based on your answer :)
Once you have the reference database at hand, you could use easy-predict to find similar genes in your input genome.

Best,
Eli

@tiantianlili
Copy link

Hello,

The way to do it is to download or construct a reference database to run against. What do you know about your genome? What taxonomic group is it? I could try to provide further advice based on your answer :) Once you have the reference database at hand, you could use easy-predict to find similar genes in your input genome.

Best, Eli

Hello, thank you for developing this software. I would like to follow up this question. I obtained contigs with a length greater than 1kbp from the metagenome data of soil contaminated with heavy metals. I noticed that there are many reference datasets of mmseqs recommended by you, some of which are nucleic acid databases (https://github.com/soedinglab/MMseqs2/wiki#downloading-databases). May I ask which database is the most suitable for me (SILVA )?

@elileka
Copy link
Member

elileka commented Dec 23, 2023

Hi,

As a reference DB MetaEuk takes in either protein or protein profiles. Therefore the nucleotide DBs available thorough the databases command, including SILVA, are not relevant.

Choosing the right protein/protein profile DB depends on your scientific goal. Here are two ideas I have, based on the details you provided:

  • UniRef50 can be a good start to find homologs for proteins, which mostly were not discovered through metagenomic experiments. This DB can be downloaded thorough the databases command and it has taxonomic and other info, which can be used to annotate your sample.
  • If you are mainly interested in discovering homologs of rare, environmental proteins and less in annotation, you can download one of these DBs. Specifically, SRC (soil) and BFD seem most suitable for your sample. However, note that (1) environmental DBs like these are generally not annotated and that (2) these DBs are large: 200-300 Gb, which means higher requirements (storage, runtime, etc.) so I would first test on smaller scales.

You can also have a look at Busco if you are interested in estimating the geneomic completeness of specific organisms via single-copy marker genes of various phylogenetic groups. BUSCO uses MetaEuk internally.

Best,
Eli

@tiantianlili
Copy link

Hi,

As a reference DB MetaEuk takes in either protein or protein profiles. Therefore the nucleotide DBs available thorough the command, including SILVA, are not relevant.databases

Choosing the right protein/protein profile DB depends on your scientific goal. Here are two ideas I have, based on the details you provided:

  • UniRef50 can be a good start to find homologs for proteins, which mostly were not discovered through metagenomic experiments. This DB can be downloaded thorough the command and it has taxonomic and other info, which can be used to annotate your sample.databases
  • If you are mainly interested in discovering homologs of rare, environmental proteins and less in annotation, you can download one of these DBs. Specifically, SRC (soil) and BFD seem most suitable for your sample. However, note that (1) environmental DBs like these are generally not annotated and that (2) these DBs are large: 200-300 Gb, which means higher requirements (storage, runtime, etc.) so I would first test on smaller scales.

You can also have a look at Busco if you are interested in estimating the geneomic completeness of specific organisms via single-copy marker genes of various phylogenetic groups. BUSCO uses MetaEuk internally.

Best, Eli

Thank you very much for your detailed reply. I'll try the UniRef50 and SRC databases first, hopefully with good results.

Best
li tian

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants