Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed running #19

Open
koujiaodahan opened this issue Sep 18, 2020 · 14 comments
Open

speed running #19

koujiaodahan opened this issue Sep 18, 2020 · 14 comments

Comments

@koujiaodahan
Copy link

Hi,
im running the software to assembly the human genome, i have runned one day, and it is still running,so how can i speed it? generally speaking , what memory perl thread? if i have sufficient memory ,can i set a bigger thread? my machine is 64cores,500G memory,here is my script:
~/backup_data/anaconda3/haslr/bin/haslr.py -t 8 -o ~/USER/lizhichao/Assembly/outdir/Assemblyoutput -g 3g -l ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_ONT.fastq.gz -x nanopore -s ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_1.fq.gz ~/USER/lizhichao/Assembly/outdir/fastq/NA24385_T7.clean_2.fq.gz &&
echo "haslr finished

@jelber2
Copy link

jelber2 commented Sep 18, 2020

change -t 8 to -t 64 perhaps

@koujiaodahan
Copy link
Author

Thanks, i have runned the 55 threads shell and not break the 8 threads process. How long do you think it will take to run the both scripts

@jelber2
Copy link

jelber2 commented Sep 18, 2020

Did you change the output directory? I have no idea how long it might take? Depends on coverage of long and short reads?

@koujiaodahan
Copy link
Author

sure,i set a new output dir

@koujiaodahan
Copy link
Author

it is always running minia for over 24 hours ,is it normal?

@jelber2
Copy link

jelber2 commented Sep 22, 2020

Minia is very fast, but genome size and coverage influence its runtime also probably choice of k-mer length and other similar types of settings.

@koujiaodahan
Copy link
Author

So,is there any recommended parameters for running human genome assembly?

@jelber2
Copy link

jelber2 commented Sep 23, 2020

Are you trying out the assembler with someone else's data or do you have a new human genome assembly that you would like to make with your own data? I would think that it would have finished by now (~5 days running). Again, you haven't specified the coverage of the Illumina or I guess Oxford Nanopore data that you are using. You can also read the paper describing HASLR for perhaps more information on the program.

@koujiaodahan
Copy link
Author

Sorry,im trying to assembly a human genome, The coverage of both short reads and long reads is 120X

@jelber2
Copy link

jelber2 commented Sep 23, 2020

I would recommend you try either GraphAligner (https://github.com/maickrau/GraphAligner) or Ratatosk (https://github.com/DecodeGenetics/Ratatosk) to error correct your Nanopore reads with your Illumina reads then assemble with Flye (https://github.com/fenderglass/Flye) using the --nano-corr option. Ratatosk even has a faster reference based method whereby to correct the reads (I haven't used this method, so I don't know the details). For Flye I really don't think you need 120x Nanopore coverage, especially if you can correct the reads. See here for running Ratatosk or here for running GraphAligner.

Edit: I guess you could use 120x Nanopore reads for a Human assembly (https://github.com/fenderglass/Flye/blob/3ee5b3390a5f88c36d0869d0382c75aba3b1f5cc/README.md#flye-benchmarks), although these data come from CHM13 (homozygous cell line). Also note the 4000 CPU hours (divide 4000 by number of available cores and you get approximately how many wall hours the assembly would take).

@koujiaodahan
Copy link
Author

Thanks,jelber2.
so haslr is not advised ?why?

@jelber2
Copy link

jelber2 commented Sep 29, 2020

In my experience, HASLR will generate very good statistics (N50, etc) for assembly using raw long reads and accurate short reads, but the error rate (indels and substitutions) of the final assembly is similar to the error rate of the long reads and not the short reads. One can improve the error rates by using long reads corrected by the short reads, and using the corrected long reads as input, but then the assembly statistics suffer. This is based off of simulation of course, and simulations are sometimes useful but can never fully capture the intricacies of real data.

@haghshenas
Copy link
Collaborator

Hi @koujiaodahan and thanks for trying HASLR.
I'm surprised that Minia is taking so long to finish. In my experience, on short read datasets from human genome with about 40x coverage, it takes about 5 hours to finish. Are you sure that Minia assembly was the step that took a long time to finish?
If yes, one solution could be subsampling short reads to about 40-50x coverage. You can use fastutils command that comes with HASLR for that purpose. So assuming you have a paired end dataset, you do the following:

fastutils interleave -q sr_1.fastq sr_2.fastq | fastutils subsample -q -g 3g -d 40 > sr_40x.fastq

With regards to the error rate of the final assembly that was raised by @jelber2, if you eventually want to perform polishing for your assembly, our results show that polished HASLR assemblies are as accurate as polished assemblies from other tools.

@koujiaodahan
Copy link
Author

yeah, i agree that the coverage is too high,so i downsampled and i got error which i released at #20 .
and i want to know whether your polishing method means running wtdbg2.pl after running haslr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants