Create low-allocation immutable Roaring bitmaps #264

lemire · 2020-07-28T18:14:08Z

Some users have a many bitmaps on disk. Loading them up temporarily for operations is expensive due in part to the allocation of new instances in this process...

newrb:= roaring.New()
newrb.ReadFrom(buf)

Related PR #263

Also related to issue #193

cc @guymolinari @jacksonrnewhouse

joenall · 2020-07-28T18:50:35Z

That is our use case. We have many bitmaps in many boltdbs on disk. Bitmaps have about 100M bits. The whole collection has a up to ~6B objects that are indexed, but none of the shards have more than 32 bit indices.

lemire · 2020-07-28T20:51:14Z

@joenall Thanks for chiming in. It is really important to document the use case to motivate a public solution.

kevinconaway · 2021-05-30T18:27:03Z

We have a similar use case where we create a lot of immutable bitmaps in memory and only read from / iterate them.

It would be great if there were improvements that could be made for immutable maps.

lemire · 2021-05-31T12:24:11Z

@kevinconaway Just checking... do you create them in this manner....?

newrb:= roaring.New()
newrb.ReadFrom(buf)

kevinconaway · 2021-05-31T17:09:40Z

No, for this use case we typically create a new Bitmap, add values to it and then never modify it again for its lifetime:

rb := roaring.New()
rb.Add()
rb.Add()
...
rb.RunOptimize()

We notice these occupying a fair amount of heap space / heap objects so I was curious if there was a more optimized in-memory layout when we know that we won't modify the Bitmap again

Apologies if I interjected on an unrelated ticket here.

lemire · 2021-05-31T17:15:35Z

@kevinconaway

We notice these occupying a fair amount of heap space / heap objects so I was curious if there was a more optimized in-memory layout when we know that we won't modify the Bitmap again

If you sum up GetSizeInBytes() on all bitmaps, you should get something close to your actual memory usage (though your actual memory usage will be a bit higher due to various overheads). If that's not the case, then you should report the issue as performance bug.

ajeetdsouza · 2021-06-01T07:20:36Z

We were facing this problem at Dgraph Labs too, so we built a library called sroar where the in-memory representation matches the on-disk representation (no need for marshalling / unmarshalling). Perhaps you will find this useful.

lemire · 2021-06-01T14:37:07Z

Regarding the last comment, I created an issue specifically for it at #298

kevinconaway · 2021-06-01T16:33:20Z

If you sum up GetSizeInBytes() on all bitmaps, you should get something close to your actual memory usage

Apologies for the confusion, the main concern here is the number of objects on the heap (the various *roaringArray, *arrayContainer and *runContainer16 values that occupy the heap)

lemire · 2021-06-01T17:40:07Z

Apologies for the confusion, the main concern here is the number of objects on the heap (the various *roaringArray, *arrayContainer and *runContainer16 values that occupy the heap)

I realize that... hence my question. The GetSizeInBytes() function will not include this overhead. We can compare this rather theoretical measure with the actual memory usage. The difference is something that can be optimized away. We have done such analysis in the past and found that the overhead was small... meaning that the benefits were modest. It could be that the story is different in some importance cases, but optimizing without benchmarks is a fool's game. So we need good documentation on the issue first.

(To be clear, I am not arguing. I am just clarifying.)

Oppen · 2021-06-16T13:54:21Z

Would it help to reuse CRoaring's frozen bitmap format?

lemire · 2021-06-16T14:29:16Z

Would it help to reuse CRoaring's frozen bitmap format?

For people reading this...

The issue is that you might have, say, an array of 64-bit integer on disk. In theory, you can just memory map the file and point at the array and load the 64-bit integers. This works at the system level on modern hardware.

But C/C++ programmers will insist that you only load 64-bit words if they are 64-bit aligned because that is how systems used to work. (Some legacy or embedded system still have this requirement but I don't think that they run Go.) You can side-step this constraint by calling memcpy each time you want to load a word, but it gets old really quickly.

There is no good portable all-around solution in C/C++, I am afraid. The compiler can assume alignment. Visual Studio allow you declare an unaligned pointer, but I don't know if an equivalent in other compilers.

The frozen bitmap format assumes that the pointer where the bitmap is found has some alignment (if not, you get a failure), and if the proper alignment is found, then it uses padding some that you can redirect pointers at the raw byte pointer... the assumption being that your system cannot load words unaligned (which is an incorrect assumption on modern hardware but one that is still with us as undefined behaviour in C/C++). It has a few constraints such as... you can't put the frozen serialized code anywhere, it needs to be aligned.

Here is the specification in C code:
https://github.com/RoaringBitmap/CRoaring/blob/88c749244efe68dcaa5987c87b191574525461c5/src/roaring.c#L2759-L2782

In Go, we simply map the pointers directly...

roaring/serialization_littleendian.go

Line 92 in 4eb34b7

func byteSliceAsUint64Slice(slice []byte) (result []uint64) {

Thus, maybe ironically, Go allows us to be more lower-level and we do not need padding. It possible that the frozen format could have benefit in Go, but the main motivation (loading alignment) for creating the format in C is not present in Go. We already do direct pointer assignment in Go while ignoring the alignment.

In effect, loading a frozen instance in C is very much like the FromBuffer call in Go.

Oppen · 2021-06-16T14:35:28Z

Right. I thought it was a format issue, that's why I suggested just recycling the old one instead of creating a new one. So it's just implementing a new load?

Oppen · 2021-06-16T14:37:43Z

There are performance benefits on using aligned data AFAIR, tho. Reads and writes of aligned data should hit the bus less often. Same way, they should invalidate the cache less often. I assumed that's why we store stuff aligned in a way it matches natural alignment AVX2 registers for bitsets.

lemire · 2021-06-16T14:56:05Z

There are performance benefits on using aligned data AFAIR, tho. Reads and writes of aligned data should hit the bus less often. Same way, they should invalidate the cache less often. I assumed that's why we store stuff aligned in a way it matches natural alignment AVX2 registers for bitsets.

Yes. People often claim that alignment is important for performance. Back in 2012, I wrote a blog post on the topic...

Data alignment for speed: myth or reality?
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

If you are scanning an array, you might hit one extra cache line due to lack of alignment. So you can build relatively contrived examples where this will impact your overall performance... but my claim is that by the time it is a concern, you will have hit many more low-hanging fruit.

It comes down to having simpler code. But if I could wrap the byte arrays in C with memcpy calls, without going insane, then we could do away entirely with the frozen format.

In C++, I could have a wrapper that calls memcpy... but in C, I'd probably need macros, and then it would get ugly.

Oppen · 2021-06-16T15:24:18Z

But if you do memcpy you lose part of the benefits of using a frozen format in the first place, as a big one is the ability to simply mmap and view. The other benefit being no extra processing nor too many allocations, however, is still intact.
But true, the cache hit argument is a bit moot if you scan the whole array. What about loading many files with mmap and computing and? In that case you could simply not load most of the bitmap, while having the view "complete" to the program's knowledge. The number of pages you'd load in that case for bitmaps could be up to 50% higher I think for not being aligned. It's contrived but I think it's possible, and I think it's actually a likely case for the project I use it in.
I'll read the article, it sounds interesting.

Oppen · 2021-06-16T15:28:01Z

Off-topic: there's a case where it really is more important to align to cache, when atomics are involved. Atomic operations invalidate a cache line on all processors, so you want to avoid those invalidating more than one line and invalidating non-related data (so you usually try to fit more related data in the same line so it's really invalid for others anyway).

lemire · 2021-06-16T15:55:26Z

But if you do memcpy you lose part of the benefits of using a frozen format in the first place, as a big one is the ability to simply mmap and view.

You do not.

You memcpy the individual words which brings them into register from an mmap pointer. You do not have to memcpy the array. There is no difference between x[i] and uint64_t x; memcpy(&x, ..,). It compiles to the same code, down to the autovectorization.

The number of pages you'd load in that case for bitmaps could be up to 50% higher I think for not being aligned.

If you are loading individual words and you put each of them at a page boundary, every other page, then you could, theoretically, have to double the number of pages you must load. But I can also come up with pathological case where alignment triggers massive cache aliasing issues (where all addresses have the same least significant bits) that destroys your cache. Pathological cases should not guide your generic design decisions.

lemire · 2021-06-16T15:56:52Z

Off-topic: there's a case where it really is more important to align to cache, when atomics are involved. Atomic operations invalidate a cache line on all processors, so you want to avoid those invalidating more than one line and invalidating non-related data (so you usually try to fit more related data in the same line so it's really invalid for others anyway).

My blog post does allude to this issue. In a multithreaded context, being aware of cache lines can pay in specific instances.

Oppen · 2021-06-16T16:06:31Z

Oh, I thought memcpying the whole contents at once.

Oppen · 2021-08-16T20:48:42Z

Coming back to this. I think an option would be to:

Convert from portable to frozen in a buffer, without creating an intermediate bitmap;
Create a view with FrozenView;
Use this trick to keep the frozen buffer alive.

It's a bit convoluted, requires either a []byte or io.ReaderAt (for flexibility) to be passed and makes a few single big allocations, but avoids creating a lot of little objects and recycles some code. Opinions?

GiedriusS · 2024-03-12T10:36:18Z

Is there still interest in this? Could I work on it?

If we are immediately performing intersection/merging after rb.ReadFrom(), it doesn't make sense for us to allocate gigabytes of memory potentially. Instead, we could allocate memory for a maximum of one container. That would be enough, right? We could call it a "lazy bitmap" or something like that.

lemire · 2024-03-12T13:00:36Z

@GiedriusS Yes. There is still much work to do.

Instead, we could allocate memory for a maximum of one container.

We support frozen views, see

roaring/serialization_littleendian.go

Line 298 in a550de6

func (rb *Bitmap) FrozenView(buf []byte) error {

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create low-allocation immutable Roaring bitmaps #264

Create low-allocation immutable Roaring bitmaps #264

lemire commented Jul 28, 2020 •

edited

joenall commented Jul 28, 2020

lemire commented Jul 28, 2020

kevinconaway commented May 30, 2021

lemire commented May 31, 2021

kevinconaway commented May 31, 2021

lemire commented May 31, 2021

ajeetdsouza commented Jun 1, 2021

lemire commented Jun 1, 2021

kevinconaway commented Jun 1, 2021

lemire commented Jun 1, 2021

Oppen commented Jun 16, 2021

lemire commented Jun 16, 2021

Oppen commented Jun 16, 2021

Oppen commented Jun 16, 2021

lemire commented Jun 16, 2021

Oppen commented Jun 16, 2021

Oppen commented Jun 16, 2021

lemire commented Jun 16, 2021

lemire commented Jun 16, 2021

Oppen commented Jun 16, 2021

Oppen commented Aug 16, 2021

GiedriusS commented Mar 12, 2024 •

edited

lemire commented Mar 12, 2024

Create low-allocation immutable Roaring bitmaps #264

Create low-allocation immutable Roaring bitmaps #264

Comments

lemire commented Jul 28, 2020 • edited

joenall commented Jul 28, 2020

lemire commented Jul 28, 2020

kevinconaway commented May 30, 2021

lemire commented May 31, 2021

kevinconaway commented May 31, 2021

lemire commented May 31, 2021

ajeetdsouza commented Jun 1, 2021

lemire commented Jun 1, 2021

kevinconaway commented Jun 1, 2021

lemire commented Jun 1, 2021

Oppen commented Jun 16, 2021

lemire commented Jun 16, 2021

Oppen commented Jun 16, 2021

Oppen commented Jun 16, 2021

lemire commented Jun 16, 2021

Oppen commented Jun 16, 2021

Oppen commented Jun 16, 2021

lemire commented Jun 16, 2021

lemire commented Jun 16, 2021

Oppen commented Jun 16, 2021

Oppen commented Aug 16, 2021

GiedriusS commented Mar 12, 2024 • edited

lemire commented Mar 12, 2024

lemire commented Jul 28, 2020 •

edited

GiedriusS commented Mar 12, 2024 •

edited