New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RawHash on R10.4 human genome #5
Comments
Until I get rawHash working on human genome for R10 data, I thought of starting with a small genome with R9 data first.
I am observing that rawhash2 is using very little of the CPU despite providing 32 threads. Is that expected or is it because every part being not multi-threaded? |
Hi @hasindu2008, For your first message: You need to specify
Please note that RawHash2 should provide relatively good accuracy for R10.4 while for R10.4.1, the accuracy is not good currently. Please see the corresponding discussion in our recent preprint about this (https://arxiv.org/pdf/2309.05771v4 Section 3.5). I will now update the main README to make it more clear when using R10 datasets. Thanks for the feedback. For your second message: This is expected because of the pipelining strategy in RawHash2. Similar to minimap2, several steps are pipelined: 1) partially reading from file (as determined by To avoid this, I would suggest increasing the minibatch size ( |
Hi @canfirtina Thanks for the prompt answer. I will stick to R9.4.1 for the moment then. For the second question, I just tried with a huge batch size of 10G. However, the CPU utilisation is only only 161%, which should have ideally been close to 3200%.
I do not think that this is a I/O bottleneck as I am not using a FAST5, instead a BLOW5. The same BLOW5 when I run through my benchmark programme that reads the slow5 and converts it to pico-ampere (which I believe is the same you would be doing in rawhash) using the same number of threads (32):
Only 5.3 s has been spent on actual disk I/O. Also when reading through the manuscript you shared, I saw the this table and I find it a bit odd. |
So, I ran the same above slow5 benchmark with a single thread and it takes 3 min 48.81 sec, which is close to the RawHash, which made me think that you are still probably using BLOW5 in a single-threaded fashion. Going through the code I can see it is indeed single-threaded Line 488 in ad5bf88
But seems like for POD5, you are reading them as batches Line 460 in ad5bf88
I suggest to at least use this slow5_get_next_batch function in slow5. This method uses easy multi-threaded API in SLOW5. The one I that would be even more efficient is the use of low-level slow5 functions such as the use of The other thing you should consider is the type of compression used in file formats. FAST5 supports zlib and vbz (zstd+streamvbyte) as compression methods. Default used to be zlib which ONT at once point made vbz the default.
You can add Line 185 in ad5bf88
|
Hello
I wanted to try out RawHash on some human genome data. I am still at the first step on generating the index.
For R9.4 I could create the index using the following command without any issue.
however, for R10.4, I am getting an error.
Any idea what I am doing wrong?
The text was updated successfully, but these errors were encountered: