Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Used EukRep and then MetaEuk but the output is mostly bacteria #38

Open
ereyred opened this issue Jan 27, 2022 · 7 comments
Open

Used EukRep and then MetaEuk but the output is mostly bacteria #38

ereyred opened this issue Jan 27, 2022 · 7 comments

Comments

@ereyred
Copy link

ereyred commented Jan 27, 2022

Hello! I ran EukRep to extract the eukaryotic sequences and then MetaEuk (static Linux AVX2 build, when I do "metaeuk version" it gives me "9818d1a5b155c28b3ef11bfa9b7c69073e669a70", idk what that means) and everything seems to work fine except when I run taxtocontig most of the output contigs are bacteria (and also some viruses).
I'm running with NR database, should I use another?
Any idea why mostly bacteria?
My contigs.fa file before running EukRep is >300 MB and after it's 2 MB so it looks like it has removed a lot of contigs as not eukaryotic.
Thanks!

@elileka
Copy link
Member

elileka commented Jan 31, 2022

Hi,

I am not the developer of EukRep but I have used it myself. As far as I know, they tried to configure their tool to be lenient. I.e., to classify as "euk" also things, which are possibly not euks, in order not to lose eukaryotic data. So that means bacteria can pass their filter.
Are you sure your sample has eukaryotic data in it?
How many contigs do you have (in the 300M file)? Perhaps, you could skip the EukRep stage and see if the results make more sense?
Where is your sample from? In principle, NR is a very good place to start.

@ucabuk
Copy link

ucabuk commented May 31, 2022

Hello Eli,

I did not open a new issue as I have the same story. I realized that you also used EukRep in your MetaEuk publication with a 5 kbp length filter. I was wondering, how did you handle the false positives ?

Did all contigs passed by that filtering already belong to eukoryotes ? Or did you perform any additional filtering in addition to EukRep?

Thanks !
Ugur

@elileka
Copy link
Member

elileka commented May 31, 2022

Hi Ugur,

Thank you for taking the time to look up what we did in the manuscript.

In our manuscript we analyzed sequences that were known to contain eukaryotic sequence and since we didn't want to lose any euks, we continued with two groups of sequences after EukRep: those classified as "Euk" and those classified as "unknown". I think it is a safe enough assumption to say that if EukRep classifies something as prokaryote then it is just that. Later, we tried to assign taxonomic labels to the contigs we had.

We have since developed this utility further in MeatEuk (and its inner library, MMseqs2). This is the taxtocontig command the original poster referred to. From your comment I am not sure, whether you have tried to or only EukRep.

Generally, it wouldn't surprise me to find out that the majority of contigs are of prokaryotic origin even in samples, which are enriched for eukaryotes. This is because there much more prokaryotes in most environments I can think of and I guess (though no expert on this), it can be hard to filter them out...

@ucabuk
Copy link

ucabuk commented Jun 1, 2022

Hi Eli,

Thank you very much for the detailed response ! It was what I am exactly trying to do.

I used EukRep with different length parameter (smaller than 5 kbp) because my case is a bit complex. Then, I used MetaEuk for the prediction and used diamond blastp to see the putative taxon for predicted proteins by MetaEuk instead of taxtocontig.

I would expect more eukaryotic taxon in the diamond result. I know it is possible that EukRep produce some false positive rates but this was really high and I know It is mostly about EukRep issue. More, I observed that the contigs classified as prokaryotes by EukRep might have still eukaryotes according to MetaEuk prediction. Does not that mean EukRep cannot really separate them ?

The another interesting thing is that when I run only MetaEuk without EukRep, I have more matches with eukaryotes. However I do not want to use directly MetaEuk since Prodigal is still good option for Prokaryotes.

Okay, If I understood you correcttly, so you accepted EukRep assumption when it classified contigs as eukaryotes and then tried to assign those contigs into taxonomic labels with taxtocontig even if it is possible that that assumption contain prokaryotes..
Thanks.
Ugur

@elileka
Copy link
Member

elileka commented Jun 2, 2022

Hi Ugur,

You raise some very reasonable questions but I think some of them could perhaps be referred to the developers of the other tools.
One thing you might want to keep in mind when using any homology-based taxonomic annotation is how broad the reference database is. Ideally, it should include as many euks possible from the clades you expect to find.
If you are still into testing thing, you could give taxtocontig a try. We also made it easy to download and use various reference databases.

Best,
Eli

@ucabuk
Copy link

ucabuk commented Jun 3, 2022

Hi Eli,

Thank you for your suggestion and hint. I will definitely give taxtocontig a try. I agree with you about the other tools. Nevertheless, I got really beneficial answers from you. Thanks !

Best,
Ugur

@zreitz
Copy link

zreitz commented Jun 17, 2022

Have you tried using a different euk/prok classifier? This one came out recently:

https://www.microbiologyresearch.org/content/journal/mgen/10.1099/mgen.0.000823
https://github.com/LottePronk/whokaryote

You could try it instead of EukRep and see if you get better results. Keep in mind that very short contigs (<1000 bp) will likely be difficult for any classifier.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants