Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Method advanceIfNeeded returns a count of skipped items #562

Open
novoj opened this issue May 27, 2022 · 9 comments
Open

Method advanceIfNeeded returns a count of skipped items #562

novoj opened this issue May 27, 2022 · 9 comments

Comments

@novoj
Copy link

novoj commented May 27, 2022

Is your feature request related to a problem? Please describe.

My use-case is that I have a very large array of objects which contain an ID in their property and I need to reduce the array to a smaller one with elements which id match some requested set. I may do a binary search and use a "comparator" on id property in the original array, but each comparison will require "hop" to a memory location where the object resides to retrieve the id property (which is rather slow).

That's why I created a RoaringBitmap with the ids that are extracted from the records in the large array and search ids in it using org.roaringbitmap.RoaringBatchIterator. The problem is that I need to "materialize" all records in RoaringBitmap via. nextBatch in order to count all examined elements of the RoaringBitmap. I'd like to use advanceIfNeeded approach to skip entire segments if the currently looked up id is not within them. The problem is that the advanceIfNeeded method doesn't return the count of elements, that has been skipped which is neccessary for me to compute the correct index in the original "large array of fat objects".

Describe the solution you'd like

Adding a new method or changing the return type of advanceIfNeeded to int (the latter would probably break backward binary compatibility of RoaringBitmaps) that would represent a number of elements that has been skipped during "fast forward" of the iterator.

Does it make sense to you?

Code snippet reflecting the situation

@Nonnull
default PriceRecordContract[] getPriceRecords(@Nonnull Bitmap priceIds) {
	final RoaringBatchIterator lookupIterator = RoaringBitmapBackedBitmap.getRoaringBitmap(priceIds).getBatchIterator();
	final RoaringBatchIterator superSetIterator = RoaringBitmapBackedBitmap.getRoaringBitmap(getIndexedPriceIds()).getBatchIterator();
	final PriceRecordContract[] priceRecords = getPriceRecords();

	final CompositeObjectArray<PriceRecordContract> filteredPriceRecords = new CompositeObjectArray<>(PriceRecordContract.class);
	int lastPriceId = 0;
	int superSetBatchPeek = 0;
	int lastSuperSetBatchRecord = 0;
	int relativeSuperSetIndex = -1;
	int[] lookupBatch = new int[512];
	int[] superSetBatch = new int[521];
	int superSetIteratorPeek = 0;

	while (lookupIterator.hasNext()) {
		final int lookupIteratorPeek = lookupIterator.nextBatch(lookupBatch);
		for (int i = 0; i < lookupIteratorPeek; i++) {
			int lookedUpId = lookupBatch[i];

			// fast-forward if the last super set batch record is lesser than looked up id
			while (lookedUpId > lastSuperSetBatchRecord && superSetIterator.hasNext()) {
				superSetBatchPeek += superSetIteratorPeek;
				superSetIteratorPeek = superSetIterator.nextBatch(superSetBatch);
				relativeSuperSetIndex = -1;
				if (superSetIteratorPeek > 0) {
					lastSuperSetBatchRecord = superSetBatch[superSetIteratorPeek - 1];
				} else {
					lastSuperSetBatchRecord = 0;
				}
			}
			
			// look for the index of price detail (searched block is getting smaller and smaller with each match)
			final int fromIndex = relativeSuperSetIndex + 1;
			final int toIndex = Math.min(fromIndex + lookedUpId - lastPriceId, superSetBatchPeek);

			// look for the price in currently read super set batch
			relativeSuperSetIndex = Arrays.binarySearch(superSetBatch, fromIndex, toIndex, lookedUpId);
			if (relativeSuperSetIndex < 0) {
				relativeSuperSetIndex = -1 * (relativeSuperSetIndex) - 2;
			} else {
				final PriceRecordContract priceRecord = priceRecords[superSetBatchPeek + relativeSuperSetIndex];
				filteredPriceRecords.add(priceRecord);
				lastPriceId = lookedUpId;
			}
		}
	}

	// return found prices as array
	return filteredPriceRecords.toArray();
}
@lemire
Copy link
Member

lemire commented May 27, 2022

The problem is that the advanceIfNeeded method doesn't return the count of elements, that has been skipped which is neccessary for me to compute the correct index in the original "large array of fat objects".

It would almost surely require additional non-trivial computations. The function advanceIfNeeded can just skip entire blocks of data without even looking at them. Your request means that we need to look at the data we skip (in general) and do a cardinality count. It is not just a matter of overhead, the computational complexity is different.

And given that current users do not need this functionality, it does not seem like a good idea to modify the existing advanceIfNeeded function.

It would be more reasonable to implement a new method, e.g., advanceAndCountIfNeeded. It would be slower than advanceIfNeeded but it is reasonable that it could help in some cases like yours.

A pull request is invited.

@richardstartin
Copy link
Member

You can do this already with an iterator and call rank for each value yielded by the iterator. Unfortunately the loop would have quadratic complexity. A new kind of skippable iterator which produces the value and the rank together, could be generally useful.

@novoj
Copy link
Author

novoj commented May 27, 2022

Thank you both for replying. I'm not familiar with the internals of the containers, so I'm not able to prepare a pull request. I thought the information of container cardinality and hi-low bounds are readily available in the containers so that you can switch among the types (array/bitmap/run). And therefore, when you advance over entire container, you know what has been skipped. But this idea surely comes with my lack of knowledge - you probably change the type of the container during add/remove operations which require "looking into" the data of the container.

@lemire
Copy link
Member

lemire commented May 27, 2022

And therefore, when you advance over entire container, you know what has been skipped.

Yes. That part is easy. But you don't typically skip over an entire containers. So think about a bitset... it is just an array of bits... 010001101110000001110000111000001. In general, the function you request requires you to count the bits in a range of indexes. We can do that fast, but it definitively is not free.

@novoj
Copy link
Author

novoj commented May 27, 2022

When I think about it - the reason I thought advanceIfNeeded skips entire containes was connected with behaviour of org.roaringbitmap.BatchIterator#nextBatch that fills only part of the provided buffer that (probably) reflect exactly contents of the single internal container. I'm not sure, but I tested it and even when I provide buffer of size 512B, only fragment of it is sometimes filled in. Maybe this fact made me think that the advanceIfNeeded on the same interface method works the same way. But that's not the case and thank you for clarification.

@lemire
Copy link
Member

lemire commented May 27, 2022

@novoj That would be a bug. If you provide a buffer, it should fill it up as much as possible. If you can identify such a case, please report it (as a distinct issue).

@richardstartin
Copy link
Member

IIRC that iterator yields at container boundaries, but returns how many elements were filled.

@novoj
Copy link
Author

novoj commented May 27, 2022

I observed the behaviour confirmed by @richardstartin - when iterating RoaringBitmap with 5000 elements using int buffer of size 500 elements I usually need more than 10 iterations to go through. But I can do it in a safe way thanks to the information about the peak record filled in the buffer. I didn't consider this a bug but a feature :).

@lemire
Copy link
Member

lemire commented May 27, 2022

IIRC that iterator yields at container boundaries, but returns how many elements were filled.

Ah. Interesting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants