You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently it is hard to utilize multicore / SSDs on TrailDB creation. tdb_cons functions can be expensive with a lot of data, so it would be beneficial to utilize CPU/IO resources more efficiently.
Here is an idea how this could be achieved:
Parallel tdb_cons_add()
K worker threads. Each with a bounded queue for incoming events.
tdb_cons_add() shards by uuid over the K threads, shuffles events to the queues.
Each thread has its own write buffer which is flushed to disk.
Lexicons need to be shared. A naive implementation could cause massive lock contention. Instead:
For small low-entropy fields, maintain a copy of the lexicon in each thread. If an entry is found in the lexicon, no need to lock anything. A missing key forces all lexicons to be synchronized.
For large high-entropy fields, shard the lexicon to M shards. Shards shared but locked individually so many lookups can succeed without any contention.
Parallel tdb_cons_finalize()
This is a profiling output from an expensive tdb_cons_finalize() call:
PROF: encoder/store_lexicons took 166505ms
PROF: encoder/store_uuids took 254ms
PROF: encoder/store_version took 0ms
PROF: trail/groupby_uuid took 839064ms
PROF: trail/info took 1ms
PROF: trail/collect_unigrams took 902892ms
PROF: encode_model/find_candidates took 992ms
PROF: encode_model/choose_grams took 829878ms
PROF: trail/gram_freqs took 830871ms
PROF: huffman/sort_symbols took 5014ms
PROF: huffman/huffman_code took 22ms
PROF: huffman/make_codemap took 2ms
PROF: trail/huff_create_codemap took 5058ms
PROF: trail/encode_trails took 1762364ms
PROF: trail/store_codebook took 3ms
PROF: encoder/encode took 4345248ms
A good candidate for parallelization is encode_trails, which can be sharded by cookie quite easily. Other functions may require more thinking.
The text was updated successfully, but these errors were encountered:
Currently it is hard to utilize multicore / SSDs on TrailDB creation.
tdb_cons
functions can be expensive with a lot of data, so it would be beneficial to utilize CPU/IO resources more efficiently.Here is an idea how this could be achieved:
Parallel
tdb_cons_add()
tdb_cons_add()
shards byuuid
over the K threads, shuffles events to the queues.Parallel
tdb_cons_finalize()
tdb_cons_finalize()
call:A good candidate for parallelization is
encode_trails
, which can be sharded by cookie quite easily. Other functions may require more thinking.The text was updated successfully, but these errors were encountered: