Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create low-allocation immutable Roaring bitmaps #264

Open
lemire opened this issue Jul 28, 2020 · 23 comments
Open

Create low-allocation immutable Roaring bitmaps #264

lemire opened this issue Jul 28, 2020 · 23 comments

Comments

@lemire
Copy link
Member

lemire commented Jul 28, 2020

Some users have a many bitmaps on disk. Loading them up temporarily for operations is expensive due in part to the allocation of new instances in this process...

newrb:= roaring.New()
newrb.ReadFrom(buf)

Related PR #263

Also related to issue #193

cc @guymolinari @jacksonrnewhouse

@joenall
Copy link
Contributor

joenall commented Jul 28, 2020

That is our use case. We have many bitmaps in many boltdbs on disk. Bitmaps have about 100M bits. The whole collection has a up to ~6B objects that are indexed, but none of the shards have more than 32 bit indices.

@lemire
Copy link
Member Author

lemire commented Jul 28, 2020

@joenall Thanks for chiming in. It is really important to document the use case to motivate a public solution.

@kevinconaway
Copy link
Contributor

We have a similar use case where we create a lot of immutable bitmaps in memory and only read from / iterate them.

It would be great if there were improvements that could be made for immutable maps.

@lemire
Copy link
Member Author

lemire commented May 31, 2021

@kevinconaway Just checking... do you create them in this manner....?

newrb:= roaring.New()
newrb.ReadFrom(buf)

@kevinconaway
Copy link
Contributor

No, for this use case we typically create a new Bitmap, add values to it and then never modify it again for its lifetime:

rb := roaring.New()
rb.Add()
rb.Add()
...
rb.RunOptimize()

We notice these occupying a fair amount of heap space / heap objects so I was curious if there was a more optimized in-memory layout when we know that we won't modify the Bitmap again

Apologies if I interjected on an unrelated ticket here.

@lemire
Copy link
Member Author

lemire commented May 31, 2021

@kevinconaway

We notice these occupying a fair amount of heap space / heap objects so I was curious if there was a more optimized in-memory layout when we know that we won't modify the Bitmap again

If you sum up GetSizeInBytes() on all bitmaps, you should get something close to your actual memory usage (though your actual memory usage will be a bit higher due to various overheads). If that's not the case, then you should report the issue as performance bug.

@ajeetdsouza
Copy link

We were facing this problem at Dgraph Labs too, so we built a library called sroar where the in-memory representation matches the on-disk representation (no need for marshalling / unmarshalling). Perhaps you will find this useful.

@lemire
Copy link
Member Author

lemire commented Jun 1, 2021

Regarding the last comment, I created an issue specifically for it at #298

@kevinconaway
Copy link
Contributor

If you sum up GetSizeInBytes() on all bitmaps, you should get something close to your actual memory usage

Apologies for the confusion, the main concern here is the number of objects on the heap (the various *roaringArray, *arrayContainer and *runContainer16 values that occupy the heap)

@lemire
Copy link
Member Author

lemire commented Jun 1, 2021

Apologies for the confusion, the main concern here is the number of objects on the heap (the various *roaringArray, *arrayContainer and *runContainer16 values that occupy the heap)

I realize that... hence my question. The GetSizeInBytes() function will not include this overhead. We can compare this rather theoretical measure with the actual memory usage. The difference is something that can be optimized away. We have done such analysis in the past and found that the overhead was small... meaning that the benefits were modest. It could be that the story is different in some importance cases, but optimizing without benchmarks is a fool's game. So we need good documentation on the issue first.

(To be clear, I am not arguing. I am just clarifying.)

@Oppen
Copy link
Contributor

Oppen commented Jun 16, 2021

Would it help to reuse CRoaring's frozen bitmap format?

@lemire
Copy link
Member Author

lemire commented Jun 16, 2021

Would it help to reuse CRoaring's frozen bitmap format?

For people reading this...

The issue is that you might have, say, an array of 64-bit integer on disk. In theory, you can just memory map the file and point at the array and load the 64-bit integers. This works at the system level on modern hardware.

But C/C++ programmers will insist that you only load 64-bit words if they are 64-bit aligned because that is how systems used to work. (Some legacy or embedded system still have this requirement but I don't think that they run Go.) You can side-step this constraint by calling memcpy each time you want to load a word, but it gets old really quickly.

There is no good portable all-around solution in C/C++, I am afraid. The compiler can assume alignment. Visual Studio allow you declare an unaligned pointer, but I don't know if an equivalent in other compilers.

The frozen bitmap format assumes that the pointer where the bitmap is found has some alignment (if not, you get a failure), and if the proper alignment is found, then it uses padding some that you can redirect pointers at the raw byte pointer... the assumption being that your system cannot load words unaligned (which is an incorrect assumption on modern hardware but one that is still with us as undefined behaviour in C/C++). It has a few constraints such as... you can't put the frozen serialized code anywhere, it needs to be aligned.

Here is the specification in C code:
https://github.com/RoaringBitmap/CRoaring/blob/88c749244efe68dcaa5987c87b191574525461c5/src/roaring.c#L2759-L2782

In Go, we simply map the pointers directly...

func byteSliceAsUint64Slice(slice []byte) (result []uint64) {

Thus, maybe ironically, Go allows us to be more lower-level and we do not need padding. It possible that the frozen format could have benefit in Go, but the main motivation (loading alignment) for creating the format in C is not present in Go. We already do direct pointer assignment in Go while ignoring the alignment.

In effect, loading a frozen instance in C is very much like the FromBuffer call in Go.

@Oppen
Copy link
Contributor

Oppen commented Jun 16, 2021

Right. I thought it was a format issue, that's why I suggested just recycling the old one instead of creating a new one. So it's just implementing a new load?

@Oppen
Copy link
Contributor

Oppen commented Jun 16, 2021

There are performance benefits on using aligned data AFAIR, tho. Reads and writes of aligned data should hit the bus less often. Same way, they should invalidate the cache less often. I assumed that's why we store stuff aligned in a way it matches natural alignment AVX2 registers for bitsets.

@lemire
Copy link
Member Author

lemire commented Jun 16, 2021

There are performance benefits on using aligned data AFAIR, tho. Reads and writes of aligned data should hit the bus less often. Same way, they should invalidate the cache less often. I assumed that's why we store stuff aligned in a way it matches natural alignment AVX2 registers for bitsets.

Yes. People often claim that alignment is important for performance. Back in 2012, I wrote a blog post on the topic...

Data alignment for speed: myth or reality?
https://lemire.me/blog/2012/05/31/data-alignment-for-speed-myth-or-reality/

If you are scanning an array, you might hit one extra cache line due to lack of alignment. So you can build relatively contrived examples where this will impact your overall performance... but my claim is that by the time it is a concern, you will have hit many more low-hanging fruit.

It comes down to having simpler code. But if I could wrap the byte arrays in C with memcpy calls, without going insane, then we could do away entirely with the frozen format.

In C++, I could have a wrapper that calls memcpy... but in C, I'd probably need macros, and then it would get ugly.

@Oppen
Copy link
Contributor

Oppen commented Jun 16, 2021

But if you do memcpy you lose part of the benefits of using a frozen format in the first place, as a big one is the ability to simply mmap and view. The other benefit being no extra processing nor too many allocations, however, is still intact.
But true, the cache hit argument is a bit moot if you scan the whole array. What about loading many files with mmap and computing and? In that case you could simply not load most of the bitmap, while having the view "complete" to the program's knowledge. The number of pages you'd load in that case for bitmaps could be up to 50% higher I think for not being aligned. It's contrived but I think it's possible, and I think it's actually a likely case for the project I use it in.
I'll read the article, it sounds interesting.

@Oppen
Copy link
Contributor

Oppen commented Jun 16, 2021

Off-topic: there's a case where it really is more important to align to cache, when atomics are involved. Atomic operations invalidate a cache line on all processors, so you want to avoid those invalidating more than one line and invalidating non-related data (so you usually try to fit more related data in the same line so it's really invalid for others anyway).

@lemire
Copy link
Member Author

lemire commented Jun 16, 2021

But if you do memcpy you lose part of the benefits of using a frozen format in the first place, as a big one is the ability to simply mmap and view.

You do not.

You memcpy the individual words which brings them into register from an mmap pointer. You do not have to memcpy the array. There is no difference between x[i] and uint64_t x; memcpy(&x, ..,). It compiles to the same code, down to the autovectorization.

The number of pages you'd load in that case for bitmaps could be up to 50% higher I think for not being aligned.

If you are loading individual words and you put each of them at a page boundary, every other page, then you could, theoretically, have to double the number of pages you must load. But I can also come up with pathological case where alignment triggers massive cache aliasing issues (where all addresses have the same least significant bits) that destroys your cache. Pathological cases should not guide your generic design decisions.

@lemire
Copy link
Member Author

lemire commented Jun 16, 2021

Off-topic: there's a case where it really is more important to align to cache, when atomics are involved. Atomic operations invalidate a cache line on all processors, so you want to avoid those invalidating more than one line and invalidating non-related data (so you usually try to fit more related data in the same line so it's really invalid for others anyway).

My blog post does allude to this issue. In a multithreaded context, being aware of cache lines can pay in specific instances.

@Oppen
Copy link
Contributor

Oppen commented Jun 16, 2021

Oh, I thought memcpying the whole contents at once.

@Oppen
Copy link
Contributor

Oppen commented Aug 16, 2021

Coming back to this. I think an option would be to:

  1. Convert from portable to frozen in a buffer, without creating an intermediate bitmap;
  2. Create a view with FrozenView;
  3. Use this trick to keep the frozen buffer alive.

It's a bit convoluted, requires either a []byte or io.ReaderAt (for flexibility) to be passed and makes a few single big allocations, but avoids creating a lot of little objects and recycles some code. Opinions?

@GiedriusS
Copy link

GiedriusS commented Mar 12, 2024

Is there still interest in this? Could I work on it?

If we are immediately performing intersection/merging after rb.ReadFrom(), it doesn't make sense for us to allocate gigabytes of memory potentially. Instead, we could allocate memory for a maximum of one container. That would be enough, right? We could call it a "lazy bitmap" or something like that.

@lemire
Copy link
Member Author

lemire commented Mar 12, 2024

@GiedriusS Yes. There is still much work to do.

Instead, we could allocate memory for a maximum of one container.

We support frozen views, see

func (rb *Bitmap) FrozenView(buf []byte) error {

See also discussions above.

It is deliberately not very well documented because it is not very well tested (in our project, although it is used in production by some).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants