Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LINE search takes much longer compared to other steps #421

Open
foriin opened this issue Jan 22, 2024 · 3 comments
Open

LINE search takes much longer compared to other steps #421

foriin opened this issue Jan 22, 2024 · 3 comments

Comments

@foriin
Copy link

foriin commented Jan 22, 2024

Hi Shujun,

This is not a bug report, but a question. I've noticed that when I run EDTA on Drosophila genome, it takes an extraordinary amount of time when searching for LINEs. Drosophila genome is populated mostly by LTRs but it takes 5-10 times more time for EDTA to look for LINEs. Is there a way to improve the speed of this step? If it's a pure repeatmasker/repeatmodeller or blast, maybe it could've been done in parallel? I can't understand how running Repeatmodeller on 150 Mb genome with 16 cores in parallel could take 10 hours...

Cheers,
Artem

@oushujun
Copy link
Owner

Hi Artem,

Unfortunately, this is the case. The LINE search function is carried out by RepeatModeler which is slow on even small genomes. Because RepeatModeler's search is based on copy number and multiple alignments, splitting the genome into small subsets may lose families that are already low copy. You can run EDTA on SSD, which will significantly improve your RepeatModeler/RepeatMasker runs because they are I/O intense.

Shujun

@foriin
Copy link
Author

foriin commented Jan 24, 2024

Thanks, Shujun,
The cluster I ran EDTA on has only SSD, I think :) I see the problem now: we need to parallelize RM, but it has to establish communication between all the jobs in parallel. Could you please tell me what specific part of RM is assigned for LINE search?

@oushujun
Copy link
Owner

RM2 is described here: https://www.pnas.org/doi/10.1073/pnas.1921046117. Fig 1 shows the workflow. Currently, the whole RM2 workflow is executed, and SINE/LINE elements are harvested at the end output of RM2. If a particular module can be separated, or RM2 being further acclerated, it would be great!

Shujun

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants