Parallel tdb_cons_add() and tdb_cons_finalize() #126

tuulos · 2017-03-11T05:02:32Z

Currently it is hard to utilize multicore / SSDs on TrailDB creation. tdb_cons functions can be expensive with a lot of data, so it would be beneficial to utilize CPU/IO resources more efficiently.

Here is an idea how this could be achieved:

Parallel `tdb_cons_add()`

K worker threads. Each with a bounded queue for incoming events.
tdb_cons_add() shards by uuid over the K threads, shuffles events to the queues.
Each thread has its own write buffer which is flushed to disk.
Lexicons need to be shared. A naive implementation could cause massive lock contention. Instead:
- For small low-entropy fields, maintain a copy of the lexicon in each thread. If an entry is found in the lexicon, no need to lock anything. A missing key forces all lexicons to be synchronized.
- For large high-entropy fields, shard the lexicon to M shards. Shards shared but locked individually so many lookups can succeed without any contention.

Parallel `tdb_cons_finalize()`

This is a profiling output from an expensive tdb_cons_finalize() call:

PROF: encoder/store_lexicons took 166505ms
PROF: encoder/store_uuids took 254ms
PROF: encoder/store_version took 0ms
PROF: trail/groupby_uuid took 839064ms
PROF: trail/info took 1ms
PROF: trail/collect_unigrams took 902892ms
PROF: encode_model/find_candidates took 992ms
PROF: encode_model/choose_grams took 829878ms
PROF: trail/gram_freqs took 830871ms
PROF: huffman/sort_symbols took 5014ms
PROF: huffman/huffman_code took 22ms
PROF: huffman/make_codemap took 2ms
PROF: trail/huff_create_codemap took 5058ms
PROF: trail/encode_trails took 1762364ms
PROF: trail/store_codebook took 3ms
PROF: encoder/encode took 4345248ms

A good candidate for parallelization is encode_trails, which can be sharded by cookie quite easily. Other functions may require more thinking.

The text was updated successfully, but these errors were encountered:

tuulos added the cons-enhancement label Mar 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallel tdb_cons_add() and tdb_cons_finalize() #126

Parallel tdb_cons_add() and tdb_cons_finalize() #126

tuulos commented Mar 11, 2017

Parallel tdb_cons_add() and tdb_cons_finalize() #126

Parallel tdb_cons_add() and tdb_cons_finalize() #126

Comments

tuulos commented Mar 11, 2017

Parallel tdb_cons_add()

Parallel tdb_cons_finalize()

Parallel `tdb_cons_add()`

Parallel `tdb_cons_finalize()`