Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long runtime - choosing correct parameters #229

Open
ctxchris opened this issue Feb 14, 2021 · 5 comments
Open

Long runtime - choosing correct parameters #229

ctxchris opened this issue Feb 14, 2021 · 5 comments

Comments

@ctxchris
Copy link

Hi,

I'm using wtdbg2 v2.5 on a 3G genome with about 60x PacBio Sequel CLR data and chose -x preset3 for large genomes. Kmer counting was done in 90 minutes with 60 threads. The overlap stage is running for eight days now and still not finished, which seems quite high compared to runtimes other get on similar genome sizes and data. Would using -x sq speed things up or do you have recommendations on which parameter set to use?

Thanks,
Chris

@ruanjue
Copy link
Owner

ruanjue commented Feb 14, 2021

Have a look at the Quatiles log message, like this

Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
    16    21    29    72   268   972  3329 11697 42939 65535

If you find the kmer were highly repetitive, please set -K 2000 to speed the alignment up.

Otherwise, please paste the log message.

Jue

@ctxchris
Copy link
Author

The kmer distribution looks ok to me.

[Fri Feb  5 16:21:16 2021] loading reads

[Fri Feb  5 16:40:16 2021] Done, 6180880 reads (>=2000 bp), 146787898735 bp, 570312591 bins
** PROC_STAT(0) **: real 1139.711 sec, user 1011.110 sec, sys 308.470 sec, maxrss 43128776.0 kB, maxvsize 49290412.0 kB
[Fri Feb  5 16:40:16 2021] Set --edge-cov to 3
KEY PARAMETERS: -k 0 -p 19 -K 1000.049988 -A -S 2.000000 -s 0.050000 -g 3000000000 -X 50.000000 -e 3 -L 2000
[Fri Feb  5 16:40:16 2021] generating nodes, 60 threads
[Fri Feb  5 16:40:16 2021] indexing bins[(0,570312591)/570312591] (146000023296/146000023296 bp), 60 threads
[Fri Feb  5 16:40:17 2021] - scanning kmers (K0P19S2.00) from 570312591 bins
********************** Kmer Frequency **********************
                                                                                                    
                   |||||||||||                                                                      
                 ||||||||||||||||                                                                   
               ||||||||||||||||||||                                                                 
              ||||||||||||||||||||||||                                                              
           |||||||||||||||||||||||||||||                                                            
     ||||||||||||||||||||||||||||||||||||||                                                         
     |||||||||||||||||||||||||||||||||||||||||                                                      
    |||||||||||||||||||||||||||||||||||||||||||||                                                   
    ||||||||||||||||||||||||||||||||||||||||||||||||||                                              
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||                        ||||||||||||||
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
   |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
  ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
 |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
**********************     1 - 201    **********************
Quatiles:
   10%   20%   30%   40%   50%   60%   70%   80%   90%   95%
    52   101   179   253   331   433   597   997  3791 16039
** PROC_STAT(0) **: real 2187.698 sec, user 51935.190 sec, sys 1418.630 sec, maxrss 52014648.0 kB, maxvsize 62463332.0 kB
[Fri Feb  5 16:57:44 2021] - high frequency kmer depth is set to 16534
[Fri Feb  5 16:57:45 2021] - Total kmers = 386006344
[Fri Feb  5 16:57:45 2021] - average kmer depth = 112
[Fri Feb  5 16:57:45 2021] - 3606933 low frequency kmers (<2)
[Fri Feb  5 16:57:45 2021] - 50726 high frequency kmers (>16534)
[Fri Feb  5 16:57:45 2021] - indexing 382348685 kmers, 42871080592 instances (at most)
[Fri Feb  5 17:29:47 2021] - indexed  382348685 kmers, 42867244619 instances
[Fri Feb  5 17:29:53 2021] - masked 655153 bins as closed
[Fri Feb  5 17:29:53 2021] - sorting
** PROC_STAT(0) **: real 4191.975 sec, user 152004.400 sec, sys 5049.500 sec, maxrss 303213028.0 kB, maxvsize 313438404.0 kB

@ruanjue
Copy link
Owner

ruanjue commented Feb 16, 2021

[Fri Feb  5 16:57:44 2021] - high frequency kmer depth is set to 16534

Try to set -K 2000 to speed it up.

@shri1984
Copy link

shri1984 commented Mar 2, 2021

Hi,
I am assembling a large genome. I tried to use -K 2000 after I see high frequency kmer depth set to about 16K. I had 6 cells of sq CLR data. My run went extremely fast. I just wonder by setting K=2000, is there an effect on the quality of the final assembly in terms of total length /number of contigs?

@ruanjue
Copy link
Owner

ruanjue commented Mar 3, 2021

In this case, if discarded too many high freq kmers, it may lead to fragmental and truncated contigs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants