You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is not a bug report, but a question. I've noticed that when I run EDTA on Drosophila genome, it takes an extraordinary amount of time when searching for LINEs. Drosophila genome is populated mostly by LTRs but it takes 5-10 times more time for EDTA to look for LINEs. Is there a way to improve the speed of this step? If it's a pure repeatmasker/repeatmodeller or blast, maybe it could've been done in parallel? I can't understand how running Repeatmodeller on 150 Mb genome with 16 cores in parallel could take 10 hours...
Cheers,
Artem
The text was updated successfully, but these errors were encountered:
Unfortunately, this is the case. The LINE search function is carried out by RepeatModeler which is slow on even small genomes. Because RepeatModeler's search is based on copy number and multiple alignments, splitting the genome into small subsets may lose families that are already low copy. You can run EDTA on SSD, which will significantly improve your RepeatModeler/RepeatMasker runs because they are I/O intense.
Thanks, Shujun,
The cluster I ran EDTA on has only SSD, I think :) I see the problem now: we need to parallelize RM, but it has to establish communication between all the jobs in parallel. Could you please tell me what specific part of RM is assigned for LINE search?
RM2 is described here: https://www.pnas.org/doi/10.1073/pnas.1921046117. Fig 1 shows the workflow. Currently, the whole RM2 workflow is executed, and SINE/LINE elements are harvested at the end output of RM2. If a particular module can be separated, or RM2 being further acclerated, it would be great!
Hi Shujun,
This is not a bug report, but a question. I've noticed that when I run EDTA on Drosophila genome, it takes an extraordinary amount of time when searching for LINEs. Drosophila genome is populated mostly by LTRs but it takes 5-10 times more time for EDTA to look for LINEs. Is there a way to improve the speed of this step? If it's a pure repeatmasker/repeatmodeller or blast, maybe it could've been done in parallel? I can't understand how running Repeatmodeller on 150 Mb genome with 16 cores in parallel could take 10 hours...
Cheers,
Artem
The text was updated successfully, but these errors were encountered: