Issue 1787 extended stats #2247

giovannicuccu · 2023-11-11T13:51:28Z

Here is a pull request for #1787
Some notes:

the "simple" stats are calculated via ExtendedStats (with the same logic used for Min, Max, etc)
the tests use an approximate comparison using an extended versione of the macro assert_nearly_equals which accepts an extra parameter specifying the max allowed difference between the two values
The algorithm for calculating the variance is Welford's online algorithm with a modification for calculating the mean since the code can use the sum (there is a comment in the code)
The sum is computed using the Kahan algorithm (as in elastic search)
The value sum_of_squares of extended stats is the same value computed by elastic search even if sum of squares has a different meaning. The "real" sum of squares is in the code so if you decide to expose it the change is straightforward
For a specific data set I made a comparison with the numbers returned by elastic search, here is a simple summary:

	variance	variance_sampling	std_deviation	std_deviation_sampling	lower	lower_sampling	upper	upper_sampling	sum of squares
ES	9.138888888888891	10.966666666666669	3.023059524536176	3.3115957885386114	-0.8794523824056855	-1.4565249104105558	11.21278571573902	11.78985824374389	215.0
Tantivy	9.138888888888888	10.966666666666663	3.023059524536176	3.311595788538611	-0.8794523824056837	-1.4565249104105549	11.212785715739017	11.78985824374389	215.0

I also did a comparison using values where the mean is very similar to the variance and Tantivy calculations are more precise (I used Excel as a reference for obtaining the value) because Elastic Search does not implement the Welford Algorithm.

…ests passing

fulmicoton · 2023-11-12T01:28:38Z

@PSeitz can you review?

PSeitz · 2023-11-14T05:42:07Z

src/aggregation/metric/stats.rs

 pub struct IntermediateStats {
+    stats: IntermediateExtendedStats,


I think the memory overhead for e.g. terms -> avg aggregation is too high

Performance may also be impacted
cargo +nightly bench --features unstable aggregation::agg_bench

giovannicuccu · 2023-11-19T07:48:56Z

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

PSeitz · 2023-11-19T11:21:31Z

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

Just as a heads up, please no macros. Generics would be fine, if the complexity stays low. Otherwise we just can keep it seperate

giovannicuccu · 2023-11-24T13:47:31Z

Some notes:

I used a Box<ExtendedStats> because the size of the enum without the box was impacting the memory usage of all the stat searches. In my experiments a term aggregation without h the box wsa consuming about 15 more ram than the actual version.
I splitted Stats and ExtendedStats, the latter is using the former

I introduced a new intermediate struct that makes the SegmentStatsCollector more flexible making it work with both stats and extendedstats; in future, if needed, you can implement the single stats (min,max,avg) independently with a minimum effort.

giovannicuccu · 2024-04-15T10:49:00Z

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

PSeitz · 2024-04-15T14:10:59Z

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

Sorry I forgot about this PR. It's fine to ping occasionally. Can you resolve the merge conflicts?
I consider adding a benchmark that also reports memory consumption for the aggregation benchmarks

giovannicuccu · 2024-04-20T10:34:05Z

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

PSeitz · 2024-04-26T10:22:01Z

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

Thanks! memory reporting is coming with this PR:
#2378

giovannicuccu · 2024-05-04T09:21:19Z

Hi, quick update: I added an extended stats bench in the corresponding file, my fork is merged with main; I'm waiting for #2378 to be merged into main before aligning with with it and then pushing it to this pr.

giovannicuccu · 2024-05-11T09:59:07Z

new version merged with main plus an update bench for extendedstats

PSeitz · 2024-05-15T04:03:09Z

The memory consumption increased by 10% for those two benchmarks. I don't think you should pay for extended stats if you are not using it

terms_many_with_avg_sub_agg                     Memory: 29.0 MB  (+9.01%)    Avg: 131.7313ms  (-3.79%)     Median: 125.7534ms  (+3.04%)    [118.0482ms .. 180.8836ms]
terms_many_json_mixed_type_with_sub_agg_card    Memory: 44.0 MB  (+9.95%)    Avg: 176.1016ms  (+13.39%)    Median: 161.8976ms  (+8.52%)    [151.0041ms .. 251.0137ms]

… usage" This reverts commit 5b4aa9f.

giovannicuccu · 2024-05-18T16:05:04Z

Hi, I dug into the issue and there 2 reason for the additional memory consumption:

I created a unified SegmentStatsCollector (both for stats and extendedstats) that requires more memory. I developed a different solution i.e. a dedicated SegmentExtendedStatsCollector that duplicates part of the code, but does not allocate more memory if you use the regular stats. This changes reduces the memory increase to 5% (i.e. it halves the memory increment from main)
I modified the stats to use Kahan summation aglorithm, because it's more precise. The new algorithm requires an extra f64, that is responsible for the remaining 5% memory increase. I can revert the stats implementation, but you are losing something in terms of precision (Elasticsearch uses Kahan summation).

Should I revert even the modification on stats in order to restore the previous memory usage?

Thanks

PSeitz · 2024-05-19T06:42:37Z

src/aggregation/metric/stats.rs

+/// they extends stats adding variance, standard deviation
+/// and bound informations
+#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
+pub struct ExtendedStats {


Can you move the extended stats stuff to extended_stats.rs?

Yes, but I need to change IntermediateStats visibility members and his collect method. if it's ok, I can refactor, do you prefer me adding accessor methods o declaring the fields as pub?

you can add pub(crate)

PSeitz · 2024-05-19T06:45:06Z

src/aggregation/metric/stats.rs

+    }
+    /// Merges the other stats intermediate result into self.
+    pub fn merge_fruits(&mut self, other: IntermediateExtendedStats) {
+        if other.intermediate_stats.count != 0 {


this should be an early exit not nesting everything

PSeitz · 2024-05-19T06:45:49Z

src/aggregation/metric/stats.rs

+    /// Merges the other stats intermediate result into self.
+    pub fn merge_fruits(&mut self, other: IntermediateExtendedStats) {
+        if other.intermediate_stats.count != 0 {
+            if self.intermediate_stats.count == 0 {


this should be an mem::replace + early exit and not nesting everything

PSeitz · 2024-05-19T06:48:39Z

src/aggregation/metric/stats.rs

+                if other.intermediate_stats.count == 1 {
+                    self.collect(other.intermediate_stats.sum);
+                } else if self.intermediate_stats.count == 1 {
+                    let sum = self.intermediate_stats.sum;
+                    self.intermediate_stats = other.intermediate_stats;
+                    self.sum_of_squares = other.sum_of_squares;
+                    self.sum_of_squares_elastic = other.sum_of_squares_elastic;
+                    self.mean = other.mean;
+                    self.delta_sum_for_squares_elastic = other.delta_sum_for_squares_elastic;
+                    self.collect(sum);
+                } else {


these cases are way to special and should be handled by the generic version

Sorry, but I don't understand it, I try to better clarify the current situation.

I created a version with a new segment collector for extended stats which reduces the extra memory increase, but I did not pushed yet, because I'm waiting for a response about kahan summation. In the new version the generic component is gone, so I cannot apply the change.

If you were referring to some other modification I ask you for some more details.

yes, we can use kahan summation.

What I mean is that these branches, they complexity they add, instead using the version that can handle all counts, needs to be justified.

if other.intermediate_stats.count == 1 { self.collect(other.intermediate_stats.sum); } else if self.intermediate_stats.count == 1 { let sum = self.intermediate_stats.sum; self.intermediate_stats = other.intermediate_stats; self.sum_of_squares = other.sum_of_squares; self.sum_of_squares_elastic = other.sum_of_squares_elastic; self.mean = other.mean; self.delta_sum_for_squares_elastic = other.delta_sum_for_squares_elastic; self.collect(sum);

PSeitz · 2024-05-19T06:50:34Z

src/aggregation/metric/stats.rs

+        let max = if self.intermediate_stats.count == 0 {
+            None
+        } else {
+            Some(self.intermediate_stats.max)
+        };
+        let avg = if self.intermediate_stats.count == 0 {
+            None
+        } else {
+            Some(self.mean)
+        };
+        let sum_of_squares = if self.intermediate_stats.count == 0 {
+            None
+        } else {
+            Some(self.sum_of_squares_elastic)
+        };
+        let variance = if self.intermediate_stats.count <= 1 {


all these different but same ifs are error prone and should be collapsed into one condition

Ok, my code was inspired by the IntermediateStats one

tantivy/src/aggregation/metric/stats.rs

Lines 119 to 133 in aa26ff5

let min = if self.count == 0 {

None

} else {

Some(self.min)

};

let max = if self.count == 0 {

None

} else {

Some(self.max)

};

let avg = if self.count == 0 {

None

} else {

Some(self.sum / (self.count as f64))

};

should I refactor that one as well?

This is also not nice, but there are all the same check so it's easy to proofread

giovannicuccu · 2024-05-19T13:55:28Z

Hi, thanks a lot for your review, this submission contains the code modifications.
This version still shows a memory usage increase (halved since the last submission) because of the introduction of Kahan summation in stats.

PSeitz · 2024-05-20T05:01:02Z

src/lib.rs

+                (left_val, right_val, epsilon_val) => {
+                    let diff = (left_val - right_val).abs();
+
+                    if diff > *epsilon_val {


can the code above call this part with epsilon 0.0005

I'm not sure I got it. Are asking for a change of epsilon in extended_stats tests or for a change in the macro using a fixed value?

the macro has two code paths, one with a parameter and one with epsilon 0.0005. The 0.0005 path can call the variant with the parameter

ok, done, thank you.

PSeitz · 2024-05-28T03:16:01Z