Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RawHash on R10.4 human genome #5

Open
hasindu2008 opened this issue May 11, 2024 · 4 comments
Open

RawHash on R10.4 human genome #5

hasindu2008 opened this issue May 11, 2024 · 4 comments
Labels
question Further information is requested

Comments

@hasindu2008
Copy link

Hello

I wanted to try out RawHash on some human genome data. I am still at the first step on generating the index.
For R9.4 I could create the index using the following command without any issue.

bin/rawhash2 -d ref.ind -p extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 /genome/hg38noAlt.fa

however, for R10.4, I am getting an error.

bin/rawhash2 -d ref.ind -p extern/kmer_models/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt -t 32 /genome/hg38noAlt.fa    Error: cannot convert '
' to float
free(): invalid next size (normal)
Aborted (core dumped)

Any idea what I am doing wrong?

@hasindu2008
Copy link
Author

hasindu2008 commented May 11, 2024

Until I get rawHash working on human genome for R10 data, I thought of starting with a small genome with R9 data first.
I am basically trying to evaluate the signal simulator squigulator with RawHash mapping (and vice versa).

squigulator test/nCoV-2019.reference.fasta -o nCoV-2019.blow5 -n 200000 -t32  -c out.paf --paf-ref
 rawhash2 -p extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 -x fast nCoV-2019.reference.fasta nCoV-2019.blow5  > a.paf

I am observing that rawhash2 is using very little of the CPU despite providing 32 threads. Is that expected or is it because every part being not multi-threaded?

@canfirtina
Copy link
Member

Hi @hasindu2008,

For your first message:

You need to specify --r10 when using R10 datasets:

bin/rawhash2 --r10 -d ref.ind -p extern/kmer_models/dna_r10.4.1_e8.2_400bps/9mer_levels_v1.txt -t 32 /genome/hg38noAlt.fa

Please note that RawHash2 should provide relatively good accuracy for R10.4 while for R10.4.1, the accuracy is not good currently. Please see the corresponding discussion in our recent preprint about this (https://arxiv.org/pdf/2309.05771v4 Section 3.5).

I will now update the main README to make it more clear when using R10 datasets. Thanks for the feedback.

For your second message:

This is expected because of the pipelining strategy in RawHash2. Similar to minimap2, several steps are pipelined: 1) partially reading from file (as determined by -K), 2) mapping, 3) writing. Although the IO step (either reading or writing) can be processed independently using one additional thread (if multiple threads are available with -t) to overlap the IO latency with mapping, IO part can still take much longer time than mapping. This means that CPU uses mostly single thread in its overall execution time due to being bottlenecked by IO. This usually happens if the mapping step is very cheap (e.g., you are working with bacterial genomes).

To avoid this, I would suggest increasing the minibatch size (-K). This can increase the peak memory usage but then it will enable utilizing multiple threads better.

@canfirtina canfirtina added the question Further information is requested label May 12, 2024
@hasindu2008
Copy link
Author

Hi @canfirtina

Thanks for the prompt answer. I will stick to R9.4.1 for the moment then.

For the second question, I just tried with a huge batch size of 10G. However, the CPU utilisation is only only 161%, which should have ideally been close to 3200%.

        Command being timed: "bin/rawhash2 -p extern/kmer_models/legacy/legacy_r9.4_180mv_450bps_6mer/template_median68pA.model -t 32 -x fast /home/hasindu/hasindu2008.git             /squigulator/test/nCoV-2019.reference.fasta /home/hasindu/hasindu2008.git/squigulator/nCoV-2019.blow5 -K10G"
        User time (seconds): 322.80
        System time (seconds): 40.66
        Percent of CPU this job got: 161%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 3:44.48

I do not think that this is a I/O bottleneck as I am not using a FAST5, instead a BLOW5. The same BLOW5 when I run through my benchmark programme that reads the slow5 and converts it to pico-ampere (which I believe is the same you would be doing in rawhash) using the same number of threads (32):

Time for disc reading 5.289125
Time for getting samples (disc+depress+parse) 17.905813
        User time (seconds): 518.51
        System time (seconds): 26.69
        Percent of CPU this job got: 2459%
        Elapsed (wall clock) time (h:mm:ss or m:ss): 0:22.16

Only 5.3 s has been spent on actual disk I/O.
When including disk I/O with the time spent on decompression and parsing this becomes 17.8 seconds. For the whole programme that does all above + convert to pA it takes 22 seconds with 2459% CPU usage.

Also when reading through the manuscript you shared, I saw the this table and I find it a bit odd.
image, especially that the performance of BLOW5 is quite slower than it should be for a case like this.
Could you link me to the two datasets you used here for E. coli and Yeast?

@hasindu2008
Copy link
Author

hasindu2008 commented May 12, 2024

So, I ran the same above slow5 benchmark with a single thread and it takes 3 min 48.81 sec, which is close to the RawHash, which made me think that you are still probably using BLOW5 in a single-threaded fashion. Going through the code I can see it is indeed single-threaded

if ((ret = slow5_get_next(&rec, sp)) < 0) {
.
But seems like for POD5, you are reading them as batches
if(pod5_get_read_batch(&batch, fp->pp, fp->cur_read) != POD5_OK){
which could be multi-threaded internally in the POD5 library.

I suggest to at least use this slow5_get_next_batch function in slow5. This method uses easy multi-threaded API in SLOW5. The one I that would be even more efficient is the use of low-level slow5 functions such as the use of slow5_get_next_bytes for reading raw bytes from the disk at the I/O stage and then doing the decompression and parsing at the processing step (for example just before doing event detection) using slow5_decode with multiple threads. A pthread example is here and an openMP example is here.

The other thing you should consider is the type of compression used in file formats. FAST5 supports zlib and vbz (zstd+streamvbyte) as compression methods. Default used to be zlib which ONT at once point made vbz the default.
BLOW5 also has zlib and zstd+streamvbyte and the default is zlib. POD5 only has vbz (actually an SIMD accelerated version of vbz). zlib is much more CPU-intensive than zstd. Unless you match the compression type in these formats, the comparison is not going to be sound. For instance, when I run my slow5 I/O benchmark using a single thread, given below is the time for disk reading + decompression + parsing for zlib vs vbz (zstd). You can use the -c zstd option in slow5tools during conversion (or can use view to convert a zlib BLOW5 to zstd BLOW5) to generate a zstd based BLOW5.

zlib:
Time for disc reading 4.750786
Time for getting samples (disc+depress+parse) 55.587349

zstd:
Time for disc reading 4.579812
Time for getting samples (disc+depress+parse) 142.601073

You can add slow5_mt=1 and zstd=1 to enable multithreading as well as zstd in slow5lib.

make -C ${SLOW5_DIR}; \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants