Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to run EDTA in large genomes (>10Gb)? #61

Closed
benedictcoombes opened this issue Feb 24, 2020 · 20 comments
Closed

How to run EDTA in large genomes (>10Gb)? #61

benedictcoombes opened this issue Feb 24, 2020 · 20 comments
Labels
question Further information is requested

Comments

@benedictcoombes
Copy link

Hi Shujun,

Thank you for producing this tool, I've had nice preliminary results with it testing on smaller genomes.
I now want to use it for a very large plant genome. I know I can divide the run by LTR/TIR/Helitron and then combine later. Can I also split by chromosome and run each in parallel to produce a raw library of each repeat type for each chromosome and combine for the rest of pipeline? Or will this cause any problems?

Many thanks,

Ben.

@oushujun
Copy link
Owner

Hi Ben,

You may, but this is not the recommended way because there will be some small sample bias during the filtering process when whole-genome information is used to determine if a candidate TE is true or false. Also you need a customized pipeline to combine different libraries. You may want to read more on this thread #38

I've run EDTA on the wheat genome and it took about a month given everything run smoothly. The trick was to use simple sequence names and guarantee exclusive resources (CPU and large memory). Things that could interrupt the jobs slow things down the most.

Best,
Shujun

@benedictcoombes
Copy link
Author

Hi Shujun,

Thanks for getting back to me. I wondered if there would be bias from whole-genome comparisons but didn't know whether that took place before or after you resume the run from the raw LTR, TIR, and helitron runs.

It is wheat that I'll be running so good to hear that it worked for you. Do you happen to remember how many cpus you used to see a run duration of 1 month? And do you know roughly what the peak memory consumption was to give me an idea for how much to allocate? I will likely allocate 32 cpus and up to about 1Tb of memory.

Many thanks,

Ben.

@oushujun
Copy link
Owner

oushujun commented Feb 24, 2020

Hi Ben,

In the raw LTR element scan there will be bias by splitting the genome. Others are fine. Bias will also present in the filtering step.

This is the resource consumption for the wheat genome (IWGSC2.0, ~15Gb) EDTA (v1.6.4) run:

  CPU* Mem (Gb) Time (days)
LTR 36 59 8.13**
TIR 36 296 26.08***
Helitron 36 118 5.42
Post_raw 36 249 12.04

*Two 18-Core Intel Skylake 6140 (2.30 GHz)
**This is an underestimation because the LTR candidates were pre-generated using LTR_FINDER_parallel (<2hr) and LTRharvest (???days) and I don't have their records anymore.
***GenericRepeatFinder took about 19 days to generate the TIR candidates and TIR-Learner took about 7 days to process.

Hope these help.

Shujun

@oushujun oushujun added the question Further information is requested label Mar 9, 2020
@KristinaGagalova
Copy link

Hi,
Is there the chance to modify TIR learner to run on GPU? That will defenitely speed up the process

@oushujun
Copy link
Owner

@KristinaGagalova There are two big parts that consume the majority of run time in TIR-Learner. One is finding the raw TIRs using GenericRepeatFinder, and the other is the filtering algorithm of TIR-Learner. Potentially both could be accelerated with GPUs but unfortunately this is out of our skillset. I am open to collaborations and contributions if you or anyone interested.

@oushujun
Copy link
Owner

oushujun commented Nov 3, 2020

For large genomes, a single node may not be able to provide enough memory to execute TIR-Learner and RepeatMasker. For such cases (eg. #129), you can:

  1. split the genome into sufficient large portions (eg. 2GB per file)
  2. run multiple independent jobs of EDTA
  3. use ./EDTA/util/make_panTElib.pl to consolidate all libraries.
  4. use the consolidated library to RepeatMask each split files and finish each EDTA jobs with --anno 1 --rmout split_file_[1:i].fa.out
  5. concatenate and sort the resulting gff3 files together.

@oushujun oushujun changed the title Splitting by chromosome How to run EDTA in large genomes (>10Gb)? Nov 3, 2020
@oushujun oushujun pinned this issue Nov 3, 2020
@oushujun
Copy link
Owner

There are some good discussions in #175 as well.

@kataksk
Copy link

kataksk commented Aug 22, 2021

Hi, thank you very much for kind instruction.

I finally got the concatenated gff3 files from large genome (>3Gbp), according to your advise below.
#61 (comment)

Is there a way to convert the gff3 (or possibly .out file also) to tbl format provided in RepeatMasker or EDTA final output format?

@oushujun
Copy link
Owner

oushujun commented Aug 22, 2021 via email

@ncnlll
Copy link

ncnlll commented Oct 13, 2022

Hi Shujun,
I'm trying to create a TE library on an amphibian large genome following the instructions mentioned above:
"1. split the genome into sufficient large portions (eg. 2GB per file)
2. run multiple independent jobs of EDTA
3. use ./EDTA/util/make_panTElib.pl to consolidate all libraries.
4. use the consolidated library to RepeatMask each split files and finish each EDTA jobs with --anno 1 --rmout split_file_[1:i].fa.out
5. concatenate and sort the resulting gff3 files together "

I followed the instructions from step1 to step3, obtaining the consolidated TE library, but I don't need the gff3 file for the pipeline I will use. Therefore, can I use this consolidated TE library as the final TE library to run RepeatMasker out of EDTA on the entire genome? I.e., are steps 4 and 5 only needed to get the gff3 file or do they also change the consolidated TE library?

Thank you
Lorena

@oushujun
Copy link
Owner

oushujun commented Oct 22, 2022 via email

@SC-Duan
Copy link

SC-Duan commented May 29, 2023

Hi, Shujun, the gif-main just takes 4 threads in TIR step, and my genome includes 100k contigs, so I think whether I can run TIR step with 1 contig and 4 threads one time (I have enough nodes and threads to run in parallel), and then use make_panTElib.pl to consolidate all libraries, does it works? Thank you!

@oushujun
Copy link
Owner

Hello @SC-Duan,

Sorry for the delay. You may be experiencing the slowdown of TIR-Learner due to large number of contigs which is a known issue (#308). You may filter out the small contigs to acclerate this step. If your genome is not very big (ie. >5gb), I will suggest to stick with a single EDTA run.

Thanks,
Shujun

@MiaCLM
Copy link

MiaCLM commented Sep 19, 2023

Dear Shujun,
Thank you for providing this amazing pipeline! I recently utilized EDTA to annotate TEs. However, I encountered some issues with the analysis speed and file size.

The input file I used is not a typical genome sequence as it contains numerous gap sequences, totaling around 5 million sequences. The file size is approximately 1.5 GB, and the analysis speed was extremely slow. Upon reviewing your response in #61, I noticed that for wheat genome analysis with LTR, it took around 8 days to complete. In my case, I used 40 threads without memory allocation, it already took about 19 days to run the LTR analysis, currently, I can only see the harvest folder but suspect there might be some issues with this task.

Could these issues be related to the number of sequences or perhaps due to an abundance of repeats within the sequences? To investigate further, I split a small portion of the file for testing purposes. This smaller split file had a size of 37 kb and contained a total of 11,017 sequences spanning approximately 38,691,979 bp in length. Surprisingly, this task took around 15 hours to complete. Is such processing time considered normal?

I am also considering splitting the larger file (1.5 GB) into five parts where each part would have an approximate size of 300 MB. However, I am unsure if this approach will yield satisfactory results or if you have any alternative suggestions. I eagerly await your prompt response.

Best regards,
MiaCai

@oushujun
Copy link
Owner

oushujun commented Sep 19, 2023 via email

@MiaCLM
Copy link

MiaCLM commented Sep 20, 2023

Hi Shujun,

Thank you for your prompt response! I will split the file into smaller parts and merge the contigs. When I merge the contigs, can I simply add 5~10 Ns to prevent mis-joining elements? You mentioned adding 100 Ns, but that would increase the file size which is not ideal for me. Also, I have a specific focus on retrotransposons annotation. Can I directly use LTR_Retriever for annotating them? If I run EDTA, which data should I use after completing the LTR part without finishing the entire task? Apologies for these questions as I am new to bioinformatics. Installing multiple software for LTR_Retriever will be time-consuming, so I'm wondering if using EDTA would allow me to quickly obtain the LTR results. Thanks a lot.

Best,
MiaCai

@oushujun
Copy link
Owner

100N will be better because 5-10N could be too short to distinguish different elements. You can run EDTA_raw.pl -type ltr to just run LTR_retriever.

Shujun

@qdu-beep
Copy link

100N will be better because 5-10N could be too short to distinguish different elements. You can run EDTA_raw.pl -type ltr to just run LTR_retriever.

Shujun

Hello, Shujun I would like to ask if I want to obtain the genome sequence with only LTR regions masked and using soft-masking, should I follow the steps below? I would greatly appreciate your response. Thank you very much.
run EDTA_raw.pl
run make_masked.pl and use the EDTA.anno/*EDTA.TEanno.out file
eg.
perl ../util/make_masked.pl -genome genome.fa -minlen 80 -hardmask 0 -t 2 -rmout genome.fa.mod.EDTA.anno/genome.fa.mod.EDTA.TEanno.out
(#166 (comment))

@oushujun
Copy link
Owner

@qdu-beep You may just run LTR_retriever, then use the .out file generated with the make_masked.pl script.

@qdu-beep
Copy link

qdu-beep commented Dec 28, 2023

@qdu-beep You may just run LTR_retriever, then use the .out file generated with the make_masked.pl script.

@oushujun You have resolved my confusion. Thank you very much for your reply, and please allow me to ask one more question.

Should I use EDTA_raw.pl with the -type ltr to obtain the raw library of LTR? And then, should I use EDTA.pl with the --overwrite 0 to further annotate and filter based on the results of raw_EDTA.pl? Finally, I should use make_masked.pl for soft masking.

I came across the following description on the website, but I'm not sure if my understanding is correct. Thank you for your assistance!
"Users may run EDTA_raw.pl for each TE type with --threads 1, then run EDTA.pl with multi threads and --overwrite 0"(https://github.com/oushujun/EDTA/releases)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

8 participants