-
-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: cardinality aggregation #2337
base: main
Are you sure you want to change the base?
feat: cardinality aggregation #2337
Conversation
|
||
let col_block_accessor = &bucket_agg_accessor.column_block_accessor; | ||
if self.column_type == ColumnType::Str { | ||
for term_id in col_block_accessor.iter_vals() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for term_id in col_block_accessor.iter_vals() { | |
for term_ord in col_block_accessor.iter_vals() { |
We call ordered ids ordinals in tantivy.
So here that would be term_ord
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
.expect("Found placeholder term_id but `missing` is None"); | ||
match missing_key { | ||
Key::Str(missing) => { | ||
self.cardinality.sketch.insert_any(&missing); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be worth treating the missing value as special as it can be very frequent, and avoid pushing it in the sketch more than once.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
) -> crate::Result<IntermediateMetricResult> { | ||
if self.column_type == ColumnType::Str { | ||
let mut buffer = String::new(); | ||
let entries: Vec<u64> = self.entries.into_iter().collect(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no need to collect it here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
"Couldn't find term_id {term_id} in dict" | ||
))); | ||
} | ||
self.cardinality.sketch.insert_any(&buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the result woudl be more accurate if we pushed a type discriminant in the sketch.
The most efficient way to do this would be to use a salted hasher.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand here. Is that about fields that hold multiple value types?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done. Test case added as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's about different types, we support mixed types on the same column name. Internally they we differentiate them via (ColumnType, Column)
.
e.g. the bytes of the number [1u8,2,3,4,5,6,7,8]
are not equal to the bytes of a string [1u8,2,3,4,5,6,7,8]
. It's not a problem during collection, because we generate two collectors, one for numbers and one for strings. But it may cause collisions during merging.
If you prefix the data with e.g. ColumnType discriminator, you won't get the collision
Edit: Ah it's already done, I just saw the default value which is set to 0
} | ||
} | ||
|
||
impl PartialEq for CardinalityCollector { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is Eq required? @PSeitz
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, we store the intermediate results in a hashmap
@PSeitz I leave you the rest of the code review. Also can you open a ticket to one or another isolate aggregation? I think it is a bit overkill for most tantivy user to have to depend on hyperloglog. |
- insert `missing` value at most once - `term_id` -> `term_ord` - iterate directly over entries without collecting first
476ddd9
to
700413c
Compare
) -> crate::Result<IntermediateMetricResult> { | ||
if self.column_type == ColumnType::Str { | ||
let mut buffer = String::new(); | ||
let term_dict = agg_with_accessor.str_dict_column.as_ref().cloned().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.unwrap_or_else(|| {
StrColumn::wrap(BytesColumn::empty(agg_with_accessor.accessor.num_docs()))
});
} | ||
} | ||
|
||
fn collect_block_with_field( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not collect, but just fetches the data
|
||
use columnar::MonotonicallyMappableToU64; | ||
|
||
use crate::aggregation::agg_req::Aggregations; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add this aggregation to test_aggregation_flushing
Looks good so far, I left some comments. I don't think |
Implements #2248