Prefix-compressing lexicon files #108

tuulos · 2016-09-29T10:17:47Z

TrailDB could handle high-cardinality fields more efficiently. We have faced two examples of high-cardinality fields recently:

IDs of format granular_timestamp + random ID (e.g. 144500000009837478)
Continuous-valued fields with a limited range (e.g. 100002, 100003, 100011)

Although the cardinality of these fields can be huge, they are not pure entropy. Both the cases have a highly repetitive prefix. We could potentially reduce the size of the lexicon files remarkably by storing them as tries, compressing away the common prefixes.

The main downside of this is that functions accessing the lexicon such as tdb_get_item_value would need to reconstruct the value on the fly, which makes it harder to return a stable pointer.

The text was updated successfully, but these errors were encountered:

tuulos added the enhancement label Mar 11, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prefix-compressing lexicon files #108

Prefix-compressing lexicon files #108

tuulos commented Sep 29, 2016

Prefix-compressing lexicon files #108

Prefix-compressing lexicon files #108

Comments

tuulos commented Sep 29, 2016