Skip to content

Persistent set of integers, implemented using PATRICIA tree data structure

License

Notifications You must be signed in to change notification settings

asmorodinov/PersistentSet

Repository files navigation

PersistentSet

This repository contains C++ implementation of a persistent integer set based on integer Patricia tree data structure.

Features

  • Persistency.
  • Big-endian order of traversal.
  • Support for bitmaps in the leaves of the tree.
  • Can store keys of any unsigned integer type.
  • Bitmaps can also be any unsigned integer types.
  • Allocator support (I would recommend to use TwoPoolsAllocator - one pool for leaf nodes, and one for branch nodes).
  • Not thread safe (but it's probably not very hard to implement thread-safe version).
  • No erase and union support yet.

Examples

Examples of how to use this data structure are provided in main.cpp file.

Interface

IntSet class defined in PatriciaSet.h implements following methods:

  • IntSet() - constructs empty set
  • void insert(T key) - inserts key into set. Note that T must be an unsigned integer type (satisfy std::unsigned_integral<T> concept).
  • bool contains(T key) const - checks whether key is present in the set, or not (i.e. key was previously inserted).
  • void clear() - clears the set.
  • Copy-constructor, assignment operator, etc are generated by compiler.

As you can see, void erase(T key) is not implemented yet, but pull requests are welcome (in my use case, erase method was not required, so it was not implemented).

Other methods, such as IntSet union(IntSet other) const (that returns union of two sets), should also be relatively easy to implement.

Performance

insert, contains have worst-case complexity of O(min(n, W)), where n is number of elements stored in a set, W is number of bits in T. In practice, it's reasonable to assume that they have complexity O(1).

clear has worst-case complexity of O(n) when it needs to delete all of the elements in the set.

Copying the data-structure has worst-case complexity of O(1).

Performance was not tested via microbenchmarks, however, in my use case it performed better than existing implementations of Patricia trees (NASA-SW-VnV/ikos and facebook/sparta) - see performance.md (3d vs 4th and 5th columns).

How to build

Originally, Microsoft Visual Studio 2022 was used to build this project (as a CMake project).

However, code should be crossplatform, and it should be possible to build this project using other methods (with CMake).

Alternatively, you can simply copy header files from PersistentSet and Allocators folders, as well as External folder (make sure to use git clone with --recurse-submodules option) and import them into your project.

Using custom allocators from Allocators folder is optional, but is recommended for performance reasons.

Dependencies

The only depencency is Immer, and it's only purpose is to provide implementation for FreeListAllocator. Feel free to remove this dependency, if you choose not to use FreeListAllocator (and for example use TwoPoolsAllocator instead).

Originally, immer was installed using vcpkg.

Implementation

Code is based on Haskell's Data.IntSet implementation.

Alternatives

  • NASA-SW-VnV/ikos - good implementation, but uses little-endian key traversal, and no support for bitmaps in leaves and allocators (so it's a little bit slower).
  • facebook/sparta - similar to ikos.
  • sikol/patricia - contains bugs as far as I know
  • libstdc++/pat_trie_.hpp - not sure how to build this and include in your own project.
  • arximboldi/immer - immer::set. Actually, it is implemented with CHAMP (Compressed Hash-Array Mapped Prefix-tree) and not Patricia tree, so it's more similar to Data.HashSet than Data.IntSet. However, it is very well implemented, and actually performs better than this implementation. It also supports erase operation, and non-integer keys. However, union operation would probably be more expensive for CHAMP compared to Patricia trees.
  • arximboldi/immer - immer::vector. It is implemented using Relaxed Radix Balanced Trees (see authors video and paper on the topic). If the keys are in some continuous range (e.g $0, \ldots, n - 1$), then you can store set of them in a vector<bool> of size $n$ ($x \in s \iff s[x] == true$), or vector<uint64_t> of size $n / 64$. This is the most efficient approach according to our benchmarks (see performance.md, 1st column).

Resources on Patricia trees

Papers

  • I recommend to read the paper:

    • Chris Okasaki and Andy Gill, "Fast Mergeable Integer Maps", Workshop on ML, September 1998, pages 77-86

    It's relatively short, explains Patricia trees very well, and includes implementations of methods in Standard ML language.

  • Original paper on Patricia trees:

    • D.R. Morrison, "PATRICIA -- Practical Algorithm To Retrieve Information Coded In Alphanumeric", Journal of the ACM, 15(4), October 1968, pages 514-534.

Books

Videos

Other

Patricia tree in other languages