Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: cardinality aggregation #2337

Open
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

raphaelcoeffic
Copy link

@raphaelcoeffic raphaelcoeffic commented Apr 1, 2024

Implements #2248


let col_block_accessor = &bucket_agg_accessor.column_block_accessor;
if self.column_type == ColumnType::Str {
for term_id in col_block_accessor.iter_vals() {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for term_id in col_block_accessor.iter_vals() {
for term_ord in col_block_accessor.iter_vals() {

We call ordered ids ordinals in tantivy.
So here that would be term_ord.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

.expect("Found placeholder term_id but `missing` is None");
match missing_key {
Key::Str(missing) => {
self.cardinality.sketch.insert_any(&missing);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth treating the missing value as special as it can be very frequent, and avoid pushing it in the sketch more than once.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

) -> crate::Result<IntermediateMetricResult> {
if self.column_type == ColumnType::Str {
let mut buffer = String::new();
let entries: Vec<u64> = self.entries.into_iter().collect();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need to collect it here.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

"Couldn't find term_id {term_id} in dict"
)));
}
self.cardinality.sketch.insert_any(&buffer);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the result woudl be more accurate if we pushed a type discriminant in the sketch.

The most efficient way to do this would be to use a salted hasher.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand here. Is that about fields that hold multiple value types?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done. Test case added as well.

Copy link
Contributor

@PSeitz PSeitz May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's about different types, we support mixed types on the same column name. Internally they we differentiate them via (ColumnType, Column).
e.g. the bytes of the number [1u8,2,3,4,5,6,7,8] are not equal to the bytes of a string [1u8,2,3,4,5,6,7,8]. It's not a problem during collection, because we generate two collectors, one for numbers and one for strings. But it may cause collisions during merging.
If you prefix the data with e.g. ColumnType discriminator, you won't get the collision

Edit: Ah it's already done, I just saw the default value which is set to 0

}
}

impl PartialEq for CardinalityCollector {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is Eq required? @PSeitz

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we store the intermediate results in a hashmap

@fulmicoton fulmicoton requested a review from PSeitz April 2, 2024 07:05
@fulmicoton
Copy link
Collaborator

fulmicoton commented Apr 2, 2024

@PSeitz I leave you the rest of the code review. Also can you open a ticket to one or another isolate aggregation? I think it is a bit overkill for most tantivy user to have to depend on hyperloglog.

@raphaelcoeffic raphaelcoeffic changed the title Draft: Cardinality aggregation feat: cardinality aggregation Apr 3, 2024
@raphaelcoeffic raphaelcoeffic marked this pull request as ready for review April 3, 2024 09:29
) -> crate::Result<IntermediateMetricResult> {
if self.column_type == ColumnType::Str {
let mut buffer = String::new();
let term_dict = agg_with_accessor.str_dict_column.as_ref().cloned().unwrap();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

            .unwrap_or_else(|| {
                StrColumn::wrap(BytesColumn::empty(agg_with_accessor.accessor.num_docs()))
            });

}
}

fn collect_block_with_field(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not collect, but just fetches the data


use columnar::MonotonicallyMappableToU64;

use crate::aggregation::agg_req::Aggregations;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add this aggregation to test_aggregation_flushing

@PSeitz
Copy link
Contributor

PSeitz commented May 28, 2024

Looks good so far, I left some comments.

I don't think hyperloglog is really big on its own, but all aggregations together may cost quite a bit if you don't use it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants