Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prefix-compressing lexicon files #108

Open
tuulos opened this issue Sep 29, 2016 · 0 comments
Open

Prefix-compressing lexicon files #108

tuulos opened this issue Sep 29, 2016 · 0 comments

Comments

@tuulos
Copy link
Member

tuulos commented Sep 29, 2016

TrailDB could handle high-cardinality fields more efficiently. We have faced two examples of high-cardinality fields recently:

  • IDs of format granular_timestamp + random ID (e.g. 144500000009837478)
  • Continuous-valued fields with a limited range (e.g. 100002, 100003, 100011)

Although the cardinality of these fields can be huge, they are not pure entropy. Both the cases have a highly repetitive prefix. We could potentially reduce the size of the lexicon files remarkably by storing them as tries, compressing away the common prefixes.

The main downside of this is that functions accessing the lexicon such as tdb_get_item_value would need to reconstruct the value on the fly, which makes it harder to return a stable pointer.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant