Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve the performance of roaring bitmap under sparse cases #319

Open
stdpain opened this issue Sep 7, 2021 · 14 comments
Open

Improve the performance of roaring bitmap under sparse cases #319

stdpain opened this issue Sep 7, 2021 · 14 comments

Comments

@stdpain
Copy link
Contributor

stdpain commented Sep 7, 2021

In a sparse roaringbitmap, the size of array_container may often be 1, 2, 3. But we can actually reuse the container pointer (sizeof(void*)) as an array of uint16_t to avoid extra memory allocation.
eg:

typedef struct roaring_array_s {
    int32_t size;
    int32_t allocation_size;
    void **containers;
    uint16_t *keys;
    uint8_t *typecodes;
    uint8_t flags;
} roaring_array_t;

roaring_array_t ra;
// some init code
ra.typecodes[i] = SINGLE_CONTAIN_TYPE + cardinality << 6;
((uint16_t*)ra.containers[i])[1] = value1;
((uint16_t*)ra.containers[i])[2] = value2;
((uint16_t*)ra.containers[i])[3] = value3;

@lemire
Copy link
Member

lemire commented Sep 7, 2021

Yes. But before we add complexity, we would need some motivating benchmarks showing the the extra work is worth it.

@stdpain
Copy link
Contributor Author

stdpain commented Sep 8, 2021

@lemire
Hi ,I tested this optimization briefly in my application and the improvement for deserialization is quite obvious.

This is my test data generator:

 // the same key will add to the same bitmap
 static std::default_random_engine e;
 static std::uniform_int_distribution<int64_t> u(start_range, end_range);
 // start_range = 0
 // end_range = 5000000000
 // keys == 1000
 for(int k = 0;k < keys;k++) {
 // lines == 50000
    for(int i = 0;i < lines; ++i) {
        std::cout << k << "\t" << u(e) << '\n';
    }
 }

This is the change I made to CRoaring (an earlier version):

stdpain@a0f5049

I test It In my Application

before:
mysql> select bitmap_union_count(v) from bitmap_r50E_x;
+-------------------------+
| bitmap_union_count(`v`) |
+-------------------------+
|               100000000 |
+-------------------------+
1 row in set (8.63 sec)
after:
mysql> select bitmap_union_count(v) from bitmap_r50E_x;
+-------------------------+
| bitmap_union_count(`v`) |
+-------------------------+
|               100000000 |
+-------------------------+
1 row in set (3.67 sec)

@lemire
Copy link
Member

lemire commented Sep 8, 2021

Yes. I saw this and it makes sense to me. I'd still want someone to review it independently.

@Oppen
Copy link
Collaborator

Oppen commented Oct 22, 2021

I proposed a similar concept a while ago for rle but never got around writing the benchmarks. I'd advice trying to use a union for the array_container_t and infer whether we're using the pointer as array from the capacity and measuring again.
That way we avoid adding a new container type, which would either require breaking backwards compatibility for the frozen format or losing this optimization when loading from them (a valid option, but if it's not too expensive to keep it I think it's better to do so). Besides, that goes in line with some proposed changes to representation that involve embedding the structs in the array anyway.
Other than that, it seems OK to me, some style fixes and the like would be needed for the provided patch.

@Oppen
Copy link
Collaborator

Oppen commented Apr 26, 2022

@stdpain would you mind making a PR from your changes? Otherwise, at some unspecified point in the future would you like me to create one (with proper credit)?

@stdpain
Copy link
Contributor Author

stdpain commented May 1, 2022

@Oppen OK

@Oppen
Copy link
Collaborator

Oppen commented Aug 11, 2022

I have a very ugly suggestion but you may be interested in trying it. Since our arrays are always sorted and never* empty we could simply zero-terminate them for the low count case, gaining a few more elements or even eliminating the indirection altogether, not storing the sizes at all. If the first element is zero, then there's a zero, but every other zero signals end of container.

@stdpain
Copy link
Contributor Author

stdpain commented Aug 11, 2022

I have a very ugly suggestion but you may be interested in trying it. Since our arrays are always sorted and never* empty we could simply zero-terminate them for the low count case, gaining a few more elements or even eliminating the indirection altogether, not storing the sizes at all. If the first element is zero, then there's a zero, but every other zero signals end of container.

sounds great.

@Oppen
Copy link
Collaborator

Oppen commented Sep 21, 2022

More than a month later, but I put that * and never what the clarification was. There's an exception for lazy operations that may leave such a state, but you have to call repair to restore the invariant lest you want undefined behavior.

@madscientist
Copy link
Contributor

I would love the "small bitmap optimization" described in #319 (comment) We do something like this but with a wrapper around the type which is annoying since we have to translate all the calls. The wrapper has a union of a roaring::Roaring (this is C++) and an array of 32bit ints plus a "small count" which we overlay on the empty padding at the end of the roaring_array_t. Since sizeof(roaring_array_t) is 40 bytes and we need at least one byte for the count field, that gives us (40 - 1) / 4 = 9 values we can use to hold 32bit values, before we have to convert from the "small" bitmap to a full bitmap.

We've tested this and in our environment, at least, where we can have a wide variety of sizes of bitmap and it really helps performance in our use-cases, even with the one-time overhead of the conversion. I'm sure that if this was supported directly the overhead could be reduced further.

What we do currently is only support a simple subset of operations on this special type of bitmap. If we need to do any of the more complex operations, for example if we're not sure the result will fit into the constant size, we convert it to a normal bitmap first. What we're keeping is basically an array bitmap, but without the overhead of the size (compile-time constant) or capacity (same as the size). Maybe an embedded implementation could get some synergy there.

@josh08287
Copy link

Would love to see this incorporated.
We have running into this situation a lot when we have sparse bitmaps that need to be merged together by a parallel algorithm.

Our use case is :
We have parallel lock free algorithms creating many bitmaps that may have sparse membership and non-unique membership. We chunk many very large files into multiple segments that create their own bitmaps that are merged together at the end of processing all chunks resulting in most bitmaps being sparsely populated. Profiling revealed that insertion into these as roaring bitmaps is very slow compared to a simple bitarray.

@lemire
Copy link
Member

lemire commented Nov 14, 2023

@josh08287 Can you contribute a benchmark?

See https://github.com/RoaringBitmap/CRoaring/tree/master/microbenchmarks

The difficulty with this issue is that people refer to internal use cases, but there is no corresponding benchmark we can refer to. It means that this issue is unlikely to make progress.

I would never merge an optimization without a corresponding credible benchmark. It is too easy to think we are optimizing when we are not.

@lemire
Copy link
Member

lemire commented Nov 14, 2023

The PR by @stdpain did not match the current code, and lacked a benchmark to demonstrate the benefits. We need contributions, if only benchmarks, for this issue to make progress.

@stdpain
Copy link
Contributor Author

stdpain commented Dec 28, 2023

I'll start by providing a benchmark for this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants