feat: cardinality aggregation #2337

raphaelcoeffic · 2024-04-01T13:28:29Z

Implements #2248

fulmicoton · 2024-04-02T06:54:08Z

src/aggregation/metric/cardinality.rs

+
+        let col_block_accessor = &bucket_agg_accessor.column_block_accessor;
+        if self.column_type == ColumnType::Str {
+            for term_id in col_block_accessor.iter_vals() {


Suggested change

for term_id in col_block_accessor.iter_vals() {

for term_ord in col_block_accessor.iter_vals() {

We call ordered ids ordinals in tantivy.
So here that would be term_ord.

fulmicoton · 2024-04-02T06:57:49Z

src/aggregation/metric/cardinality.rs

+                        .expect("Found placeholder term_id but `missing` is None");
+                    match missing_key {
+                        Key::Str(missing) => {
+                            self.cardinality.sketch.insert_any(&missing);


It might be worth treating the missing value as special as it can be very frequent, and avoid pushing it in the sketch more than once.

fulmicoton · 2024-04-02T06:59:56Z

src/aggregation/metric/cardinality.rs

+    ) -> crate::Result<IntermediateMetricResult> {
+        if self.column_type == ColumnType::Str {
+            let mut buffer = String::new();
+            let entries: Vec<u64> = self.entries.into_iter().collect();


no need to collect it here.

fulmicoton · 2024-04-02T07:00:57Z

src/aggregation/metric/cardinality.rs

+                            "Couldn't find term_id {term_id} in dict"
+                        )));
+                    }
+                    self.cardinality.sketch.insert_any(&buffer);


I think the result woudl be more accurate if we pushed a type discriminant in the sketch.

The most efficient way to do this would be to use a salted hasher.

Not sure I understand here. Is that about fields that hold multiple value types?

done. Test case added as well.

it's about different types, we support mixed types on the same column name. Internally they we differentiate them via (ColumnType, Column).
e.g. the bytes of the number [1u8,2,3,4,5,6,7,8] are not equal to the bytes of a string [1u8,2,3,4,5,6,7,8]. It's not a problem during collection, because we generate two collectors, one for numbers and one for strings. But it may cause collisions during merging.
If you prefix the data with e.g. ColumnType discriminator, you won't get the collision

Edit: Ah it's already done, I just saw the default value which is set to 0

fulmicoton · 2024-04-02T07:04:26Z

src/aggregation/metric/cardinality.rs

+    }
+}
+
+impl PartialEq for CardinalityCollector {


is Eq required? @PSeitz

yes, we store the intermediate results in a hashmap

fulmicoton · 2024-04-02T07:06:07Z

@PSeitz I leave you the rest of the code review. Also can you open a ticket to one or another isolate aggregation? I think it is a bit overkill for most tantivy user to have to depend on hyperloglog.

- insert `missing` value at most once - `term_id` -> `term_ord` - iterate directly over entries without collecting first

PSeitz · 2024-05-28T08:08:57Z

src/aggregation/metric/cardinality.rs

+    ) -> crate::Result<IntermediateMetricResult> {
+        if self.column_type == ColumnType::Str {
+            let mut buffer = String::new();
+            let term_dict = agg_with_accessor.str_dict_column.as_ref().cloned().unwrap();


.unwrap_or_else(|| { StrColumn::wrap(BytesColumn::empty(agg_with_accessor.accessor.num_docs())) });

PSeitz · 2024-05-28T08:31:32Z

src/aggregation/metric/cardinality.rs

+        }
+    }
+
+    fn collect_block_with_field(


this does not collect, but just fetches the data

PSeitz · 2024-05-28T08:53:54Z

src/aggregation/metric/cardinality.rs

+
+    use columnar::MonotonicallyMappableToU64;
+
+    use crate::aggregation::agg_req::Aggregations;


can you add this aggregation to test_aggregation_flushing

PSeitz · 2024-05-28T09:13:02Z

Looks good so far, I left some comments.

I don't think hyperloglog is really big on its own, but all aggregations together may cost quite a bit if you don't use it.

fulmicoton reviewed Apr 2, 2024

View reviewed changes

fulmicoton requested a review from PSeitz April 2, 2024 07:05

raphaelcoeffic changed the title ~~Draft: Cardinality aggregation~~ feat: cardinality aggregation Apr 3, 2024

raphaelcoeffic marked this pull request as ready for review April 3, 2024 09:29

raphaelcoeffic added 7 commits April 12, 2024 08:12

WiP: cardinality aggregation

ca0c6d4

Collect unique entries first, then insert into HyperLogLog

06b641d

Handle missing

62ca5b7

Hybrid approach

b2e97d3

Review changes

1a985f8

- insert `missing` value at most once - `term_id` -> `term_ord` - iterate directly over entries without collecting first

Use salted hasher to include column type

cd284d1

fix: formatting

700413c

raphaelcoeffic force-pushed the cardinality_aggregation branch from 476ddd9 to 700413c Compare April 12, 2024 06:14

PSeitz reviewed May 28, 2024

View reviewed changes

src/aggregation/metric/cardinality.rs

}

}

fn collect_block_with_field(

Copy link

Contributor

PSeitz May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not collect, but just fetches the data

PSeitz reviewed May 28, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: cardinality aggregation #2337

feat: cardinality aggregation #2337

raphaelcoeffic commented Apr 1, 2024 •

edited

fulmicoton Apr 2, 2024

raphaelcoeffic Apr 2, 2024

fulmicoton Apr 2, 2024

raphaelcoeffic Apr 2, 2024

fulmicoton Apr 2, 2024

raphaelcoeffic Apr 2, 2024

fulmicoton Apr 2, 2024

raphaelcoeffic Apr 2, 2024

raphaelcoeffic Apr 3, 2024

PSeitz May 28, 2024 •

edited

fulmicoton Apr 2, 2024

PSeitz May 28, 2024

fulmicoton commented Apr 2, 2024 •

edited

PSeitz May 28, 2024

PSeitz May 28, 2024

PSeitz May 28, 2024

PSeitz commented May 28, 2024

	for term_id in col_block_accessor.iter_vals() {
	for term_ord in col_block_accessor.iter_vals() {


		use columnar::MonotonicallyMappableToU64;

		use crate::aggregation::agg_req::Aggregations;

feat: cardinality aggregation #2337

Are you sure you want to change the base?

feat: cardinality aggregation #2337

Conversation

raphaelcoeffic commented Apr 1, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PSeitz May 28, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fulmicoton commented Apr 2, 2024 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

PSeitz commented May 28, 2024

raphaelcoeffic commented Apr 1, 2024 •

edited

PSeitz May 28, 2024 •

edited

fulmicoton commented Apr 2, 2024 •

edited