Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question/advice: iterating over an a.and(b) bitmap whilst keeping ranks of elements from 'a' #425

Open
touisteur opened this issue Dec 1, 2020 · 11 comments

Comments

@touisteur
Copy link

Hi,

Sorry if it's not the place for questions or advice.

I'm using RoaringBitmap as indexes in an internal tool, and the performance kick I get from them is amazing compared to sparse matrices or my homemade clumsy solutions...

Still there's one operation I'm doing very often that I'm not sure how to implement efficiently.

I have 2 bitmaps : blocks and index. blocks usually has very high cardinality, index usually lower, sometimes the same, sometimes higher.

I want to iterate over blocks.and(index) BUT I need to keep the positions/ranks of elements from blocks. So for each element of blocks.and(index) I'd like to have that element's rank/position in blocks.

What I do today is I iterate (forEach) over blocks.and(index) and for each element I get their rank from blocks through rank() or select(). The performance is... not great.

I'm wondering whether there'd be a better, more idiomatic way of doing this? Should I be writing a specific andIterator that would give me this info? Anything I should be aware whilst going this path? Am I completely out of RoaringBitmaps' use case?

Thanks in advance for your help.
Lionel

@touisteur touisteur changed the title Question/advice: iterating over an a.and(b) bitmap whilst keeping a ranks Question/advice: iterating over an a.and(b) bitmap whilst keeping ranks of elements from 'a' Dec 1, 2020
@ppiotrow
Copy link
Member

ppiotrow commented Dec 1, 2020

Can you give some simple examples of what you're trying to achieve? It sounds similar to my code where one bitmap represents ranges (blocks?) and second actual values. I want to estimate how many ranges contains at least one value.
I made few implementations but the fastest one is using two getIntIterator iterators and advanceIfNeeded jumps.

Example 1

val ranges = bitmapOf(0, 5, 6, 22, 29, 50)
val source = bitmapOf(0, 1, 2, 3, 4, 8)
val distinct = NonEmptyRangesCounter.count(source, ranges)

distinct shouldBe 2 //range 0-5, and 6-22 contains at least one value

Example 2

val ranges = bitmapOf(0, 15, 16, 19, 23, 29, 47, 80)
val source = bitmapOf(0, 2, 3, 4, 5, 15, 16, 18, 24, 30, 60, 61, 64, 82, 89, 90)
val distinct = NonEmptyRangesCounter.count(source, ranges)

distinct shouldBe 7

@richardstartin
Copy link
Member

@ppiotrow I doubt @touisteur has access to NonEmptyRangesCounter to see how it was implemented. Could you share its implementation to make the example more revealing?

@ppiotrow
Copy link
Member

ppiotrow commented Dec 1, 2020

I can share, but still not sure if he has similar problem to be solved. The main hint was to consider using iterator instead of AND, OR if you iterate very long loop.
The code is actually a little bit more complicated, as it works in the partitioned universe:( If you de-partition the code it gets very simple loop with jumping iterators.
code tests

We are using it on giant dataset to estimate number of users performing some activities. Each range groups activity of single user.

Edit: code without partitions

@touisteur
Copy link
Author

Hi thanks for all the kind and interesting answers!

My use case is : I'm given a (sometimes) huge list of blocks. Each block is a variable-sized ByteBuffer that, once sloooooowly deserialized, may or may not contain a specific field. So I perform, offline, a first veeeery slow pass where I note the presence of each field in each block, and I thus end up with a roaringbitmap of all the blocks containing this field. That's my 'presence' bitmap. The 'blocks' bitmap is given by the user as a list of blocks to look-up.

The result of the operation must have the same indexes/ranks than the 'blocks' input.

I think the double iterator approach would fit. I was worried about its performance. Didn't know about advance methods.

@touisteur
Copy link
Author

Hi sorry I just reread the blocks=ranges comment, I'm not working with ranges, just discrete indices (like a unique increasing ID in a file).

As an example :
blocks = {0, 1, 6, 7, 8, 230}
presence = {0, 6}

I'd want to have on my final structure
result = {0, null, 6, null, null}
or
result.add(0, 0)
result.add(2, 6)

@ppiotrow
Copy link
Member

ppiotrow commented Dec 1, 2020

blocks& presence equals {0,6} not {0, null, 6, null, null}. You cannot express reserved positions with nulls in Roaring Bitmap or any Set implementation.
You can however calculate blocks.andNot(presence) = {1,7,8,230} for "null" positions. Not sure if it helps.

@touisteur
Copy link
Author

Oh yes thanks I understand this is not directly possible with sets.

I was looking for the 'best' way to iterate over the result of the 'and' operation with the minimum overhead for the rank/select operation on blocks /for each element of blocks.and(presence)/

It seems having two running iterators might be the 'best' way to do this? Is there an example somewhere of this approach?

@lemire
Copy link
Member

lemire commented Dec 1, 2020

It seems having two running iterators might be the 'best' way to do this?

I would not call this approach "minimum overhead". Using iterators brings about a lot of overhead. To get good performance, you typically want to delay using iterators as much as possible.

I recommend you run benchmarks.

@touisteur
Copy link
Author

It seems your code @ppiotrow does almost what i want with the advanceIfNeeded() I'll need to try that and compare with the naive blocks.and(presence).foreach (i => result.add (blocks.select(i), i)

@touisteur
Copy link
Author

It seems having two running iterators might be the 'best' way to do this?

I would not call this approach "minimum overhead". Using iterators brings about a lot of overhead. To get good performance, you typically want to delay using iterators as much as possible.

I recommend you run benchmarks.

Indeed thanks everyone for all the insights. Will report with benchmark results when I'm done implementing both.

BTW that was a very friendly and instructive and helpful interaction, thanks a lot.

@touisteur
Copy link
Author

blocks& presence equals {0,6} not {0, null, 6, null, null}. You cannot express reserved positions with nulls in Roaring Bitmap or any Set implementation.
You can however calculate blocks.andNot(presence) = {1,7,8,230} for "null" positions. Not sure if it helps.

Oh thanks it took me 3 takes to understand your idea. I'll try that also.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants