Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multiple round of metaeuk #44

Open
xiekunwhy opened this issue Apr 28, 2022 · 1 comment
Open

Multiple round of metaeuk #44

xiekunwhy opened this issue Apr 28, 2022 · 1 comment

Comments

@xiekunwhy
Copy link

Hi,

I am anotating some big animal and plant genomes, when doing homolog base annotation, I want to use those proteins in OrthoDB as homolog proteins, but I found that there are too many protein sequences (5,000,000+ for vertebrate) and metaeuk is slow.

May I cut the whole protein database into 10s or 100s pieces and run metaeuk using each piece seperately, then combine all targets sequences in metaeuk results, and run metaeuk again using this combined target sequences to get the final results?

Best,
Kun

@elileka
Copy link
Member

elileka commented Jul 11, 2022

Hi,

I am very sorry for the late reply. This issue somehow escaped me.

What you suggest sounds reasonable. Basically, it is a way to pre-filter the target database and retain only the sequences that have potential to contribute something at a later stage. However, if it is too involved to implement the idea, here are other things you could try:

  1. Divide your contigs to several input files and run each against the large target database
  2. Cluster your target database and use only the representative sequences as a slimmer version of the target (or construct profiles from each cluster)
  3. Choose a different, smaller target database. You can find some options using the command databases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants