How to run EDTA in large genomes (>10Gb)? #61

benedictcoombes · 2020-02-24T12:02:25Z

Hi Shujun,

Thank you for producing this tool, I've had nice preliminary results with it testing on smaller genomes.
I now want to use it for a very large plant genome. I know I can divide the run by LTR/TIR/Helitron and then combine later. Can I also split by chromosome and run each in parallel to produce a raw library of each repeat type for each chromosome and combine for the rest of pipeline? Or will this cause any problems?

Many thanks,

Ben.

oushujun · 2020-02-24T17:46:24Z

Hi Ben,

You may, but this is not the recommended way because there will be some small sample bias during the filtering process when whole-genome information is used to determine if a candidate TE is true or false. Also you need a customized pipeline to combine different libraries. You may want to read more on this thread #38

I've run EDTA on the wheat genome and it took about a month given everything run smoothly. The trick was to use simple sequence names and guarantee exclusive resources (CPU and large memory). Things that could interrupt the jobs slow things down the most.

Best,
Shujun

benedictcoombes · 2020-02-24T18:08:49Z

Hi Shujun,

Thanks for getting back to me. I wondered if there would be bias from whole-genome comparisons but didn't know whether that took place before or after you resume the run from the raw LTR, TIR, and helitron runs.

It is wheat that I'll be running so good to hear that it worked for you. Do you happen to remember how many cpus you used to see a run duration of 1 month? And do you know roughly what the peak memory consumption was to give me an idea for how much to allocate? I will likely allocate 32 cpus and up to about 1Tb of memory.

Many thanks,

Ben.

oushujun · 2020-02-24T19:16:28Z

Hi Ben,

In the raw LTR element scan there will be bias by splitting the genome. Others are fine. Bias will also present in the filtering step.

This is the resource consumption for the wheat genome (IWGSC2.0, ~15Gb) EDTA (v1.6.4) run:

	CPU*	Mem (Gb)	Time (days)
LTR	36	59	8.13**
TIR	36	296	26.08***
Helitron	36	118	5.42
Post_raw	36	249	12.04

*Two 18-Core Intel Skylake 6140 (2.30 GHz)
**This is an underestimation because the LTR candidates were pre-generated using LTR_FINDER_parallel (<2hr) and LTRharvest (???days) and I don't have their records anymore.
***GenericRepeatFinder took about 19 days to generate the TIR candidates and TIR-Learner took about 7 days to process.

Hope these help.

Shujun

KristinaGagalova · 2020-04-30T14:58:10Z

Hi,
Is there the chance to modify TIR learner to run on GPU? That will defenitely speed up the process

oushujun · 2020-04-30T17:39:57Z

@KristinaGagalova There are two big parts that consume the majority of run time in TIR-Learner. One is finding the raw TIRs using GenericRepeatFinder, and the other is the filtering algorithm of TIR-Learner. Potentially both could be accelerated with GPUs but unfortunately this is out of our skillset. I am open to collaborations and contributions if you or anyone interested.

oushujun · 2020-11-03T09:30:49Z

For large genomes, a single node may not be able to provide enough memory to execute TIR-Learner and RepeatMasker. For such cases (eg. #129), you can:

split the genome into sufficient large portions (eg. 2GB per file)
run multiple independent jobs of EDTA
use ./EDTA/util/make_panTElib.pl to consolidate all libraries.
use the consolidated library to RepeatMask each split files and finish each EDTA jobs with --anno 1 --rmout split_file_[1:i].fa.out
concatenate and sort the resulting gff3 files together.

oushujun · 2021-04-19T08:55:55Z

There are some good discussions in #175 as well.

kataksk · 2021-08-22T03:58:59Z

Hi, thank you very much for kind instruction.

I finally got the concatenated gff3 files from large genome (>3Gbp), according to your advise below.
#61 (comment)

Is there a way to convert the gff3 (or possibly .out file also) to tbl format provided in RepeatMasker or EDTA final output format?

oushujun · 2021-08-22T15:42:32Z

Yes, please check out the pinned issues. Shujun

…

On Sat, Aug 21, 2021 at 10:59 PM kataksk ***@***.***> wrote: Hi, thank you very much for kind instruction. I finally got the concatenated gff3 files from large genome (>3Gbp), according to your advise below. #61 (comment) <#61 (comment)> Is there a way to convert the gff3 (or possibly .out file also) to tbl format provided in RepeatMasker or EDTA final output format? — You are receiving this because you modified the open/close state. Reply to this email directly, view it on GitHub <#61 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NCTFYQ6GVWMZBILLG3T6BYY3ANCNFSM4K2G222Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&utm_campaign=notification-email> .

ncnlll · 2022-10-13T16:03:30Z

Hi Shujun,
I'm trying to create a TE library on an amphibian large genome following the instructions mentioned above:
"1. split the genome into sufficient large portions (eg. 2GB per file)
2. run multiple independent jobs of EDTA
3. use ./EDTA/util/make_panTElib.pl to consolidate all libraries.
4. use the consolidated library to RepeatMask each split files and finish each EDTA jobs with --anno 1 --rmout split_file_[1:i].fa.out
5. concatenate and sort the resulting gff3 files together "

I followed the instructions from step1 to step3, obtaining the consolidated TE library, but I don't need the gff3 file for the pipeline I will use. Therefore, can I use this consolidated TE library as the final TE library to run RepeatMasker out of EDTA on the entire genome? I.e., are steps 4 and 5 only needed to get the gff3 file or do they also change the consolidated TE library?

Thank you
Lorena

oushujun · 2022-10-22T05:14:25Z

Yes you are right. Let me know how it goes. Best, Shujun

…

On Thu, Oct 13, 2022 at 12:03 PM ncnlll ***@***.***> wrote: Hi Shujun, I'm trying to create a TE library on an amphibian large genome following the instructions mentioned above: "1. split the genome into sufficient large portions (eg. 2GB per file) 2. run multiple independent jobs of EDTA 3. use ./EDTA/util/make_panTElib.pl to consolidate all libraries. 4. use the consolidated library to RepeatMask each split files and finish each EDTA jobs with --anno 1 --rmout split_file_[1:i].fa.out 5. concatenate and sort the resulting gff3 files together " I followed the instructions from step1 to step3, obtaining the consolidated TE library, but I don't need the gff3 file for the pipeline I will use. Therefore, can I use this consolidated TE library as the final TE library to run RepeatMasker out of EDTA on the entire genome? I.e., are steps 4 and 5 only needed to get the gff3 file or do they also change the consolidated TE library? Thank you Lorena — Reply to this email directly, view it on GitHub <#61 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NCGU5JFRNNNNT3ZK6DWDAXFZANCNFSM4K2G222Q> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

SC-Duan · 2023-05-29T17:19:44Z

Hi, Shujun, the gif-main just takes 4 threads in TIR step, and my genome includes 100k contigs, so I think whether I can run TIR step with 1 contig and 4 threads one time (I have enough nodes and threads to run in parallel), and then use make_panTElib.pl to consolidate all libraries, does it works? Thank you!

oushujun · 2023-06-27T21:34:45Z

Hello @SC-Duan,

Sorry for the delay. You may be experiencing the slowdown of TIR-Learner due to large number of contigs which is a known issue (#308). You may filter out the small contigs to acclerate this step. If your genome is not very big (ie. >5gb), I will suggest to stick with a single EDTA run.

Thanks,
Shujun

MiaCLM · 2023-09-19T09:35:04Z

Dear Shujun,
Thank you for providing this amazing pipeline! I recently utilized EDTA to annotate TEs. However, I encountered some issues with the analysis speed and file size.

The input file I used is not a typical genome sequence as it contains numerous gap sequences, totaling around 5 million sequences. The file size is approximately 1.5 GB, and the analysis speed was extremely slow. Upon reviewing your response in #61, I noticed that for wheat genome analysis with LTR, it took around 8 days to complete. In my case, I used 40 threads without memory allocation, it already took about 19 days to run the LTR analysis, currently, I can only see the harvest folder but suspect there might be some issues with this task.

Could these issues be related to the number of sequences or perhaps due to an abundance of repeats within the sequences? To investigate further, I split a small portion of the file for testing purposes. This smaller split file had a size of 37 kb and contained a total of 11,017 sequences spanning approximately 38,691,979 bp in length. Surprisingly, this task took around 15 hours to complete. Is such processing time considered normal?

I am also considering splitting the larger file (1.5 GB) into five parts where each part would have an approximate size of 300 MB. However, I am unsure if this approach will yield satisfactory results or if you have any alternative suggestions. I eagerly await your prompt response.

Best regards,
MiaCai

oushujun · 2023-09-19T13:10:02Z

Hi MaiCai, 1.5G is not big. the problem is too many contigs that slow down the file system drastically. Splitting into multiple pieces should work. Alternatively, you can manually connect contigs into longer sequences to accelerate. After obtaining the library, you can use it to annotate the original genone. shujun

…

On Tue, Sep 19, 2023, 5:35 AM MiaCLM ***@***.***> wrote: Dear Shujun, Thank you for providing this amazing pipeline! I recently utilized EDTA to annotate TEs. However, I encountered some issues with the analysis speed and file size. The input file I used is not a typical genome sequence as it contains numerous gap sequences, totaling around 5 million sequences. The file size is approximately 1.5 GB, and the analysis speed was extremely slow. Upon reviewing your response in #61 <#61>, I noticed that for wheat genome analysis with LTR, it took around 8 days to complete. In my case, I used 40 threads without memory allocation, it already took about 19 days to run the LTR analysis, currently, I can only see the harvest folder but suspect there might be some issues with this task. Could these issues be related to the number of sequences or perhaps due to an abundance of repeats within the sequences? To investigate further, I split a small portion of the file for testing purposes. This smaller split file had a size of 37 kb and contained a total of 11,017 sequences spanning approximately 38,691,979 bp in length. Surprisingly, this task took around 15 hours to complete. Is such processing time considered normal? I am also considering splitting the larger file (1.5 GB) into five parts where each part would have an approximate size of 300 MB. However, I am unsure if this approach will yield satisfactory results or if you have any alternative suggestions. I eagerly await your prompt response. Best regards, MiaCai — Reply to this email directly, view it on GitHub <#61 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABNX4NHRILIUFPA3U7W6CTLX3FRNFANCNFSM4K2G222Q> . You are receiving this because you modified the open/close state.Message ID: ***@***.***>

MiaCLM · 2023-09-20T15:34:08Z

Hi Shujun,

Thank you for your prompt response! I will split the file into smaller parts and merge the contigs. When I merge the contigs, can I simply add 5~10 Ns to prevent mis-joining elements? You mentioned adding 100 Ns, but that would increase the file size which is not ideal for me. Also, I have a specific focus on retrotransposons annotation. Can I directly use LTR_Retriever for annotating them? If I run EDTA, which data should I use after completing the LTR part without finishing the entire task? Apologies for these questions as I am new to bioinformatics. Installing multiple software for LTR_Retriever will be time-consuming, so I'm wondering if using EDTA would allow me to quickly obtain the LTR results. Thanks a lot.

Best,
MiaCai

oushujun · 2023-09-21T14:12:23Z

100N will be better because 5-10N could be too short to distinguish different elements. You can run EDTA_raw.pl -type ltr to just run LTR_retriever.

Shujun

qdu-beep · 2023-12-27T08:18:10Z

100N will be better because 5-10N could be too short to distinguish different elements. You can run EDTA_raw.pl -type ltr to just run LTR_retriever.

Shujun

Hello, Shujun I would like to ask if I want to obtain the genome sequence with only LTR regions masked and using soft-masking, should I follow the steps below? I would greatly appreciate your response. Thank you very much.
run EDTA_raw.pl
run make_masked.pl and use the EDTA.anno/*EDTA.TEanno.out file
eg.
perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out
(#166 (comment))

oushujun · 2023-12-27T18:03:20Z

@qdu-beep You may just run LTR_retriever, then use the .out file generated with the make_masked.pl script.

qdu-beep · 2023-12-28T02:32:36Z

@qdu-beep You may just run LTR_retriever, then use the .out file generated with the make_masked.pl script.

@oushujun You have resolved my confusion. Thank you very much for your reply, and please allow me to ask one more question.

Should I use EDTA_raw.pl with the -type ltr to obtain the raw library of LTR? And then, should I use EDTA.pl with the --overwrite 0 to further annotate and filter based on the results of raw_EDTA.pl? Finally, I should use make_masked.pl for soft masking.

I came across the following description on the website, but I'm not sure if my understanding is correct. Thank you for your assistance!
"Users may run EDTA_raw.pl for each TE type with --threads 1, then run EDTA.pl with multi threads and --overwrite 0"(https://github.com/oushujun/EDTA/releases)

oushujun added the question Further information is requested label Mar 9, 2020

oushujun mentioned this issue Mar 9, 2020

Identifying TIR uses only one CPU #55

Closed

oushujun mentioned this issue Jun 7, 2020

Expected Memory Usage #87

Closed

oushujun mentioned this issue Nov 3, 2020

Does the EDTA cannot process a genome containing too many sequences? #129

Closed

oushujun changed the title ~~Splitting by chromosome~~ How to run EDTA in large genomes (>10Gb)? Nov 3, 2020

oushujun pinned this issue Nov 3, 2020

oushujun closed this as completed Apr 19, 2021

oliviamr mentioned this issue Jun 19, 2021

with TIR/Sola2 TE_Sorter chokes? #178

Closed

qdu-beep mentioned this issue Jan 2, 2024

The use of EDTA_raw.pl and EDTA.pl #415

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to run EDTA in large genomes (>10Gb)? #61

How to run EDTA in large genomes (>10Gb)? #61

benedictcoombes commented Feb 24, 2020

oushujun commented Feb 24, 2020

benedictcoombes commented Feb 24, 2020

oushujun commented Feb 24, 2020 •

edited

KristinaGagalova commented Apr 30, 2020

oushujun commented Apr 30, 2020

oushujun commented Nov 3, 2020

oushujun commented Apr 19, 2021

kataksk commented Aug 22, 2021

oushujun commented Aug 22, 2021 via email

ncnlll commented Oct 13, 2022

oushujun commented Oct 22, 2022 via email

SC-Duan commented May 29, 2023 •

edited

oushujun commented Jun 27, 2023

MiaCLM commented Sep 19, 2023

oushujun commented Sep 19, 2023 via email

MiaCLM commented Sep 20, 2023

oushujun commented Sep 21, 2023

qdu-beep commented Dec 27, 2023

oushujun commented Dec 27, 2023

qdu-beep commented Dec 28, 2023 •

edited

How to run EDTA in large genomes (>10Gb)? #61

How to run EDTA in large genomes (>10Gb)? #61

Comments

benedictcoombes commented Feb 24, 2020

oushujun commented Feb 24, 2020

benedictcoombes commented Feb 24, 2020

oushujun commented Feb 24, 2020 • edited

KristinaGagalova commented Apr 30, 2020

oushujun commented Apr 30, 2020

oushujun commented Nov 3, 2020

oushujun commented Apr 19, 2021

kataksk commented Aug 22, 2021

oushujun commented Aug 22, 2021 via email

ncnlll commented Oct 13, 2022

oushujun commented Oct 22, 2022 via email

SC-Duan commented May 29, 2023 • edited

oushujun commented Jun 27, 2023

MiaCLM commented Sep 19, 2023

oushujun commented Sep 19, 2023 via email

MiaCLM commented Sep 20, 2023

oushujun commented Sep 21, 2023

qdu-beep commented Dec 27, 2023

oushujun commented Dec 27, 2023

qdu-beep commented Dec 28, 2023 • edited

oushujun commented Feb 24, 2020 •

edited

SC-Duan commented May 29, 2023 •

edited

qdu-beep commented Dec 28, 2023 •

edited