Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading improvements #144

Open
bricoletc opened this issue Apr 11, 2019 · 4 comments
Open

Multi-threading improvements #144

bricoletc opened this issue Apr 11, 2019 · 4 comments

Comments

@bricoletc
Copy link
Member

  1. Following Quasimapping with multiple threads odd behaviour #127 , here is the speedup obtained on quasimap on different threads:
    Fold_speedup

The black line is the theoretical best speedup (= number of threads).

Dataset: 250,000 reads (so 500,000 with the reverse complements) from /nfs/leia/research/iqbal/bletcher/Pf_benchmark/all_reads/_original_/PG0496-C.bam mapped to pf3k prg build from /nfs/research1/zi/bletcher/Pf_benchmark/ref_data
(yoda cluster)

Command:
bsub -R select[mem>60000] rusage[mem=60000] -M60000 -J threads_4 -n 4 -o /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/threads/logs/t4_k8.o \ -e /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/threads/logs/t4_k8.e singularity exec /nfs/leia/research/iqbal/bletcher/Singularity_Images/8b46a86_gramtools.img gramtools quasimap \ --gram-dir /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/gram_k8 --run-dir /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/threads/run_t4 \ --reads /nfs/leia/research/iqbal/bletcher/Pf_benchmark/tests/subsetted_reads/PG0496-C.trim.fq.1.1MSubset.gz --max-threads 4
Commit 8b46a86

@bricoletc
Copy link
Member Author

@leoisl has noticed one (probably) bug and one possible improvement:

  1. [Bug] At https://github.com/iqbal-lab-org/gramtools/blob/master/libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp#L52 :
    when an allele combination has not been seen before, an entry is created in the site_coverage map assigning counts to allele 'equivalence classes' (for a given variant site). If two different threads (processing two different read mappings to the graph) try to both do this, the program would crash.

Pre-allocating entries to 0 count does not seem reasonable since the number of equivalence classes is the power set of the number of alleles in a variant site.

  1. [Improvement] Scheduling. Right now, each thread gets just one read. We could give each thread a chunk of reads (eg 1K). This could be more efficient than having threads constantly being provided with single reads at a time.

@bricoletc
Copy link
Member Author

In response to this:

@leoisl has noticed one (probably) bug and one possible improvement:

  1. [Bug] At https://github.com/iqbal-lab-org/gramtools/blob/master/libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp#L52 :
    when an allele combination has not been seen before, an entry is created in the site_coverage map assigning counts to allele 'equivalence classes' (for a given variant site). If two different threads (processing two different read mappings to the graph) try to both do this, the program would crash.

I have gotten two seg faults from ~100 separate instances of multi-threaded quasimapping to Plasmodium prg.

Here is the stack trace from gdb of the core dump of one of them (I do this from inside singularity container:/nfs/leia/research/iqbal/bletcher/Pf_benchmark/full_Cross/04ff58e_prod.img in yoda) :

gdb /usr/local/lib/python3.6/dist-packages/gramtools/bin/gram core.30852

Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00005647921b3ffe in std::_Hashtable<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (__code=3864292196, __k=..., __n=2, this=0x2b797278fcd8)
    at /usr/include/c++/7/bits/hashtable.h:1548
1548	      for (__node_type* __p = static_cast<__node_type*>(__prev_p->_M_nxt);;

The core is also at /nfs/leia/research/iqbal/bletcher/Pf_benchmark/full_Cross/

And then:

(gdb) where
#0  0x00005647921b3ffe in std::_Hashtable<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_before_node (__code=3864292196, __k=std::vector of length 1, capacity 1 = {...}, __n=2, this=0x2b797278fcd8)
    at /usr/include/c++/7/bits/hashtable.h:1548
#1  std::_Hashtable<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_find_node (__c=3864292196, __key=std::vector of length 1, capacity 1 = {...}, __bkt=2, this=0x2b797278fcd8)
    at /usr/include/c++/7/bits/hashtable.h:642
#2  std::__detail::_Map_base<std::vector<unsigned int, std::allocator<unsigned int> >, std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long>, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> >, std::__detail::_Select1st, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true>, true>::operator[] (this=0x2b797278fcd8, __k=std::vector of length 1, capacity 1 = {...})
    at /usr/include/c++/7/bits/hashtable_policy.h:721
#3  0x00005647921b3488 in std::unordered_map<std::vector<unsigned int, std::allocator<unsigned int> >, unsigned long, gram::sequence_hash<std::vector<unsigned int, std::allocator<unsigned int> > >, std::equal_to<std::vector<unsigned int, std::allocator<unsigned int> > >, std::allocator<std::pair<std::vector<unsigned int, std::allocator<unsigned int> > const, unsigned long> > >::operator[] (__k=std::vector of length 1, capacity 1 = {...}, this=<optimized out>) at /usr/include/c++/7/bits/unordered_map.h:973
#4  gram::coverage::record::grouped_allele_counts (coverage=..., search_states=std::__cxx11::list = {...})
    at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp:54
#5  0x00005647921a6be6 in gram::coverage::record::search_states (coverage=..., search_states=..., read_length=@0x2b7973395d08: 100, prg_info=..., random_seed=<optimized out>)
    at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/coverage/common.cpp:212
#6  0x000056479219bfa4 in gram::quasimap_read (read=std::vector of length 100, capacity 100 = {...}, coverage=..., kmer_index=std::unordered_map with 3376442 elements = {...}, 
    prg_info=..., parameters=...) at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/quasimap.cpp:211
#7  0x000056479219c34d in handle_reads_buffer () at /tmp/pip-44dd3srs-build/libgramtools/src/quasimap/quasimap.cpp:181
#8  0x00002b792f4f995e in ?? () from /usr/lib/x86_64-linux-gnu/libgomp.so.1
#9  0x00002b792f9316db in start_thread (arg=0x2b7973396700) at pthread_create.c:463
#10 0x00002b792fc6a88f in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:95

From ::operator[] inside gram::coverage::record::grouped_allele_counts this looks exactly like @leoisl 's bug find..

@bricoletc bricoletc pinned this issue May 8, 2019
@bricoletc
Copy link
Member Author

Here is what happens under bricoletc@e66865e ; the instruction #pragma omp atomic is changed to #pragma omp critical in libgramtools/src/quasimap/coverage/grouped_allele_counts.cpp, now threads cannot update grouped coverage simulateneously.
Per_Sample_CPU_Time_bars

Some samples take much more time: 803, PG0459-C. This could be followed up.
Sample PG0463-C is where a segfault occurred prior to change, hence low CPU run time.

Time_diff
There is an overall increase of ~5000 CPU seconds, which amounts to about 1h increased run time per quasimapping, compared to overall ~15hours. I have not seen segfaults yet, so this might be a good solution.
We could do better by locking threads on each site's coverage structure; right now any coverage write is locked for all sites

@leoisl
Copy link

leoisl commented May 10, 2019

Hello,

wow, pretty nice and detailed benchmarking! I guess this should be addressed in the future, but for now it is important that your experiments aren't blocked and the slow down is small, so the tool is runnable. We should discuss quasimapping and how to improve mapping speed, together with this and possibly other stuff like scheduling policies, bloom filters filtering, etc...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants