Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mutable RangeBitmap appender #688

Open
bastrich opened this issue Nov 22, 2023 · 5 comments
Open

Mutable RangeBitmap appender #688

bastrich opened this issue Nov 22, 2023 · 5 comments

Comments

@bastrich
Copy link

bastrich commented Nov 22, 2023

Is your feature request related to a problem? Please describe.
I have the case, when I need to update RangeBitmap on the fly, but it is immutable, so I have to rebuilt it from scratch every time.

Describe the solution you'd like
I researched the source code, conducted some experiments a bit, and found that implementing a mutable version of RangeBitmap.Appender and adding the 'deserialize' method for it can be a good solution here. For the mutable version of the Appender we can add a method with signature like
public void replace(index: int, prevValue: long, newValue: long). The implementation can be based on the changing values in the existing maskBuffer or mask (depending on the current key and rid values). In order to calculate new mask state in the correct way we should store an array of char with the size equal to the maskBuffer size, let's call it maskTrackingBuffer (and the corresponding variable maskTracking). And this new array should contain number of '1' for the corresponding position in maskBuffer. Then, the 'replace' algorithm can be similar to the following pseudocode:

// prevValue[i] and newValue[i] mean i-th bit

if (mask[i] == 0 && newValue[i] == 0) {
  do nothing;
  return;
}

if (mask[i] == 0 && newValue[i] == 1) {
  mask[i] = 1;
  return;
}

if (mask[i] == 1 && newValue[i] == 0) {
  if (prevValue[i] == 0) {
    do nothing;
    return;
  }

  if (maskTracking[i] <= 1) {
    mask[i] = 0;
    return;
  } else {
    do nothing;
    return;
  }
  maskTracking[i]--;
}

if (mask[i] == 1 && newValue[i] == 1) {
  do nothing;
  return;
}

Of course, there will be memory overhead for appender, I estimate it like (maskBuffer size) * 16 bits. But for relatively small ranges like, for example, 0-200_000 it should be almost unnoticeable. And anyway there will be choice between immutable and mutable Appender.

I tried to find the solution with less memory consumption but couldn't figure out something more optimal.

Does it sound ok? Is it ok if I implement this logic and create a pull request?

@bastrich
Copy link
Author

For the case when the range is 0 - Long.MAX_VALUE, the memory overhead should be around 2Mb which looks big but may be acceptable for most of the cases.

@lemire
Copy link
Member

lemire commented Nov 23, 2023

@richardstartin Do you want to take this one?

@richardstartin
Copy link
Member

richardstartin commented Nov 23, 2023

hi @bastrich - there's a couple of things to consider

  1. the existing immutable implementation is used for range indexes on offline segments in Apache Pinot, and I don't want to risk performance regressions because it's used in some large systems. So I particularly like that your proposal introduces an entirely separate implementation of the appender.
  2. the existing implementation is somewhat over-optimised for the workflow of building indexes and mapping them from disk on offline segments in Pinot, and along with being immutable, the data structure is densely packed which means there is no space for containers to grow into. Combined with roaring container compression for regions of the bit slices, this makes the feasibility of non-append updates somewhat complicated, because flipping a single bit can have significant ramifications for layout. In the best case, flipping a bit in a bitmap container does not affect layout. In run or array containers the container may need to grow or shrink, which would require shifting all data after the site of the modification. If the container type has to change as a result of the modification, then the update is even more disruptive to layout. I'm not sure that your proposal deals with this complexity - but if I have missed something please let me know and we can discuss it further.

Despite the complexity, several users have now requested mutability in various issues, so I think it's worth implementing. Point 1 above warrants a separate implementation (leaving the existing code alone means users can't be affected).

Point 2 above requires a different in-memory layout using Container objects which can manage all the complexities of growing/changing locally without invalidating the rest of the data structure. A new class MutableRangeBitmap could have all of the same public instance methods as RangeBitmap (eq, lt and so on) but include methods to update values given a row index, and could be implemented in terms of a List<Container[]> or similar. This is very similar to the way ImmutableRoaringBitmap and RoaringBitmap are different implementations of the same data structure but the first one supports memory mapping at the expense of updates, and the second one is mutable but not mapped from disk.

At some point in the future we could consider bringing the two implementations under the same interface (which would probably end up being a breaking change because the best name for that interface might be RangeBitmap and given that there is only one major user, this can be managed). This would eventually lead to things like test reuse across implementations.

Do you want to have a go at implementing MutableRangeBitmap using List<Container[]> instead? I'm happy to discuss further and help with review.

@bastrich
Copy link
Author

Hi @richardstartin, thank you for the detailed response.

I didn't think of the solution you proposed but I will check it, thank you!
Also I will try to implement the solution I proposed to better show what I mean. Perhaps I'm wrong, then I can go only with the solution you proposed.

I can create 2 PRs so we can look at both options and compare them.

Are there any expected timelines?

@bastrich
Copy link
Author

Hi @richardstartin, sorry for my original post, I researched the source code a bit more and can see that my proposal doesn't work and doesn't make any sense (or at least is not full). Indeed, I have to work with containers and their layout. Will try to research more and implement some working solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants