speed running #19

koujiaodahan · 2020-09-18T09:18:45Z

Hi,
im running the software to assembly the human genome, i have runned one day, and it is still running,so how can i speed it? generally speaking , what memory perl thread? if i have sufficient memory ,can i set a bigger thread? my machine is 64cores,500G memory,here is my script:
~/backup_data/anaconda3/haslr/bin/haslr.py -t 8 -o ~/USER/lizhichao/Assembly/outdir/Assemblyoutput -g 3g -l ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_ONT.fastq.gz -x nanopore -s ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_1.fq.gz ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_2.fq.gz &&
echo "haslr finished

jelber2 · 2020-09-18T09:24:40Z

change -t 8 to -t 64 perhaps

koujiaodahan · 2020-09-18T10:34:36Z

Thanks, i have runned the 55 threads shell and not break the 8 threads process. How long do you think it will take to run the both scripts

jelber2 · 2020-09-18T11:07:03Z

Did you change the output directory? I have no idea how long it might take? Depends on coverage of long and short reads?

koujiaodahan · 2020-09-18T21:01:13Z

sure,i set a new output dir

koujiaodahan · 2020-09-20T11:04:33Z

it is always running minia for over 24 hours ,is it normal?

jelber2 · 2020-09-22T11:36:41Z

Minia is very fast, but genome size and coverage influence its runtime also probably choice of k-mer length and other similar types of settings.

koujiaodahan · 2020-09-23T07:25:00Z

So,is there any recommended parameters for running human genome assembly?

jelber2 · 2020-09-23T07:32:50Z

Are you trying out the assembler with someone else's data or do you have a new human genome assembly that you would like to make with your own data? I would think that it would have finished by now (~5 days running). Again, you haven't specified the coverage of the Illumina or I guess Oxford Nanopore data that you are using. You can also read the paper describing HASLR for perhaps more information on the program.

koujiaodahan · 2020-09-23T09:59:42Z

Sorry,im trying to assembly a human genome, The coverage of both short reads and long reads is 120X

jelber2 · 2020-09-23T11:37:23Z

I would recommend you try either GraphAligner (https://github.com/maickrau/GraphAligner) or Ratatosk (https://github.com/DecodeGenetics/Ratatosk) to error correct your Nanopore reads with your Illumina reads then assemble with Flye (https://github.com/fenderglass/Flye) using the --nano-corr option. Ratatosk even has a faster reference based method whereby to correct the reads (I haven't used this method, so I don't know the details). For Flye I really don't think you need 120x Nanopore coverage, especially if you can correct the reads. See here for running Ratatosk or here for running GraphAligner.

Edit: I guess you could use 120x Nanopore reads for a Human assembly (https://github.com/fenderglass/Flye/blob/3ee5b3390a5f88c36d0869d0382c75aba3b1f5cc/README.md#flye-benchmarks), although these data come from CHM13 (homozygous cell line). Also note the 4000 CPU hours (divide 4000 by number of available cores and you get approximately how many wall hours the assembly would take).

koujiaodahan · 2020-09-29T06:29:05Z

Thanks,jelber2.
so haslr is not advised ?why?

jelber2 · 2020-09-29T06:37:05Z

In my experience, HASLR will generate very good statistics (N50, etc) for assembly using raw long reads and accurate short reads, but the error rate (indels and substitutions) of the final assembly is similar to the error rate of the long reads and not the short reads. One can improve the error rates by using long reads corrected by the short reads, and using the corrected long reads as input, but then the assembly statistics suffer. This is based off of simulation of course, and simulations are sometimes useful but can never fully capture the intricacies of real data.

haghshenas · 2020-09-30T04:38:24Z

Hi @koujiaodahan and thanks for trying HASLR.
I'm surprised that Minia is taking so long to finish. In my experience, on short read datasets from human genome with about 40x coverage, it takes about 5 hours to finish. Are you sure that Minia assembly was the step that took a long time to finish?
If yes, one solution could be subsampling short reads to about 40-50x coverage. You can use fastutils command that comes with HASLR for that purpose. So assuming you have a paired end dataset, you do the following:

fastutils interleave -q sr_1.fastq sr_2.fastq | fastutils subsample -q -g 3g -d 40 > sr_40x.fastq

With regards to the error rate of the final assembly that was raised by @jelber2, if you eventually want to perform polishing for your assembly, our results show that polished HASLR assemblies are as accurate as polished assemblies from other tools.

koujiaodahan · 2020-09-30T07:33:58Z

yeah, i agree that the coverage is too high,so i downsampled and i got error which i released at #20 .
and i want to know whether your polishing method means running wtdbg2.pl after running haslr?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

speed running #19

speed running #19

koujiaodahan commented Sep 18, 2020

jelber2 commented Sep 18, 2020

koujiaodahan commented Sep 18, 2020

jelber2 commented Sep 18, 2020

koujiaodahan commented Sep 18, 2020

koujiaodahan commented Sep 20, 2020

jelber2 commented Sep 22, 2020

koujiaodahan commented Sep 23, 2020

jelber2 commented Sep 23, 2020

koujiaodahan commented Sep 23, 2020

jelber2 commented Sep 23, 2020 •

edited

koujiaodahan commented Sep 29, 2020

jelber2 commented Sep 29, 2020

haghshenas commented Sep 30, 2020

koujiaodahan commented Sep 30, 2020

speed running #19

speed running #19

Comments

koujiaodahan commented Sep 18, 2020

jelber2 commented Sep 18, 2020

koujiaodahan commented Sep 18, 2020

jelber2 commented Sep 18, 2020

koujiaodahan commented Sep 18, 2020

koujiaodahan commented Sep 20, 2020

jelber2 commented Sep 22, 2020

koujiaodahan commented Sep 23, 2020

jelber2 commented Sep 23, 2020

koujiaodahan commented Sep 23, 2020

jelber2 commented Sep 23, 2020 • edited

koujiaodahan commented Sep 29, 2020

jelber2 commented Sep 29, 2020

haghshenas commented Sep 30, 2020

koujiaodahan commented Sep 30, 2020

jelber2 commented Sep 23, 2020 •

edited