Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 1787 extended stats #2247

Open
wants to merge 32 commits into
base: main
Choose a base branch
from

Conversation

giovannicuccu
Copy link
Contributor

Here is a pull request for #1787
Some notes:

  • the "simple" stats are calculated via ExtendedStats (with the same logic used for Min, Max, etc)
  • the tests use an approximate comparison using an extended versione of the macro assert_nearly_equals which accepts an extra parameter specifying the max allowed difference between the two values
  • The algorithm for calculating the variance is Welford's online algorithm with a modification for calculating the mean since the code can use the sum (there is a comment in the code)
  • The sum is computed using the Kahan algorithm (as in elastic search)
  • The value sum_of_squares of extended stats is the same value computed by elastic search even if sum of squares has a different meaning. The "real" sum of squares is in the code so if you decide to expose it the change is straightforward
  • For a specific data set I made a comparison with the numbers returned by elastic search, here is a simple summary:
variance variance_sampling std_deviation std_deviation_sampling lower lower_sampling upper upper_sampling sum of squares
ES 9.138888888888891 10.966666666666669 3.023059524536176 3.3115957885386114 -0.8794523824056855 -1.4565249104105558 11.21278571573902 11.78985824374389 215.0
Tantivy 9.138888888888888 10.966666666666663 3.023059524536176 3.311595788538611 -0.8794523824056837 -1.4565249104105549 11.212785715739017 11.78985824374389 215.0
  • I also did a comparison using values where the mean is very similar to the variance and Tantivy calculations are more precise (I used Excel as a reference for obtaining the value) because Elastic Search does not implement the Welford Algorithm.

@fulmicoton
Copy link
Collaborator

@PSeitz can you review?

pub struct IntermediateStats {
stats: IntermediateExtendedStats,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the memory overhead for e.g. terms -> avg aggregation is too high

Performance may also be impacted
cargo +nightly bench --features unstable aggregation::agg_bench

@giovannicuccu
Copy link
Contributor Author

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

@PSeitz
Copy link
Contributor

PSeitz commented Nov 19, 2023

This is just a ping. I'm working on it, I created a dataset and a simple application that uses several gb of memory for computing some statistics. I'm using the main branch as the memory consumption reference and working on a solution for keeping it the same.

Just as a heads up, please no macros. Generics would be fine, if the complexity stays low. Otherwise we just can keep it seperate

@giovannicuccu
Copy link
Contributor Author

Some notes:

  • I used a Box<ExtendedStats> because the size of the enum without the box was impacting the memory usage of all the stat searches. In my experiments a term aggregation without h the box wsa consuming about 15 more ram than the actual version.
  • I splitted Stats and ExtendedStats, the latter is using the former
  • I introduced a new intermediate struct that makes the SegmentStatsCollector more flexible making it work with both stats and extendedstats; in future, if needed, you can implement the single stats (min,max,avg) independently with a minimum effort.

@giovannicuccu
Copy link
Contributor Author

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

@PSeitz
Copy link
Contributor

PSeitz commented Apr 15, 2024

Hello, I'd like to know how to proceed with this pull request, it has been pending for a while.

Sorry I forgot about this PR. It's fine to ping occasionally. Can you resolve the merge conflicts?
I consider adding a benchmark that also reports memory consumption for the aggregation benchmarks

@giovannicuccu
Copy link
Contributor Author

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

@PSeitz
Copy link
Contributor

PSeitz commented Apr 26, 2024

Merge is done, do you have an example for benchmark that also reports memory consumption? I ran cargo bench but saw no indication of memory consumption.

Thanks! memory reporting is coming with this PR:
#2378

@giovannicuccu
Copy link
Contributor Author

Hi, quick update: I added an extended stats bench in the corresponding file, my fork is merged with main; I'm waiting for #2378 to be merged into main before aligning with with it and then pushing it to this pr.

@giovannicuccu
Copy link
Contributor Author

new version merged with main plus an update bench for extendedstats

@PSeitz
Copy link
Contributor

PSeitz commented May 15, 2024

The memory consumption increased by 10% for those two benchmarks. I don't think you should pay for extended stats if you are not using it

terms_many_with_avg_sub_agg                     Memory: 29.0 MB  (+9.01%)    Avg: 131.7313ms  (-3.79%)     Median: 125.7534ms  (+3.04%)    [118.0482ms .. 180.8836ms]
terms_many_json_mixed_type_with_sub_agg_card    Memory: 44.0 MB  (+9.95%)    Avg: 176.1016ms  (+13.39%)    Median: 161.8976ms  (+8.52%)    [151.0041ms .. 251.0137ms]

@giovannicuccu
Copy link
Contributor Author

Hi, I dug into the issue and there 2 reason for the additional memory consumption:

  1. I created a unified SegmentStatsCollector (both for stats and extendedstats) that requires more memory. I developed a different solution i.e. a dedicated SegmentExtendedStatsCollector that duplicates part of the code, but does not allocate more memory if you use the regular stats. This changes reduces the memory increase to 5% (i.e. it halves the memory increment from main)
  2. I modified the stats to use Kahan summation aglorithm, because it's more precise. The new algorithm requires an extra f64, that is responsible for the remaining 5% memory increase. I can revert the stats implementation, but you are losing something in terms of precision (Elasticsearch uses Kahan summation).

Should I revert even the modification on stats in order to restore the previous memory usage?

Thanks

/// they extends stats adding variance, standard deviation
/// and bound informations
#[derive(Clone, Debug, PartialEq, Serialize, Deserialize)]
pub struct ExtendedStats {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move the extended stats stuff to extended_stats.rs?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but I need to change IntermediateStats visibility members and his collect method. if it's ok, I can refactor, do you prefer me adding accessor methods o declaring the fields as pub?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can add pub(crate)

}
/// Merges the other stats intermediate result into self.
pub fn merge_fruits(&mut self, other: IntermediateExtendedStats) {
if other.intermediate_stats.count != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an early exit not nesting everything

/// Merges the other stats intermediate result into self.
pub fn merge_fruits(&mut self, other: IntermediateExtendedStats) {
if other.intermediate_stats.count != 0 {
if self.intermediate_stats.count == 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be an mem::replace + early exit and not nesting everything

Comment on lines 372 to 382
if other.intermediate_stats.count == 1 {
self.collect(other.intermediate_stats.sum);
} else if self.intermediate_stats.count == 1 {
let sum = self.intermediate_stats.sum;
self.intermediate_stats = other.intermediate_stats;
self.sum_of_squares = other.sum_of_squares;
self.sum_of_squares_elastic = other.sum_of_squares_elastic;
self.mean = other.mean;
self.delta_sum_for_squares_elastic = other.delta_sum_for_squares_elastic;
self.collect(sum);
} else {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these cases are way to special and should be handled by the generic version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, but I don't understand it, I try to better clarify the current situation.

I created a version with a new segment collector for extended stats which reduces the extra memory increase, but I did not pushed yet, because I'm waiting for a response about kahan summation. In the new version the generic component is gone, so I cannot apply the change.

If you were referring to some other modification I ask you for some more details.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, we can use kahan summation.

What I mean is that these branches, they complexity they add, instead using the version that can handle all counts, needs to be justified.

if other.intermediate_stats.count == 1 {
    self.collect(other.intermediate_stats.sum);
} else if self.intermediate_stats.count == 1 {
    let sum = self.intermediate_stats.sum;
    self.intermediate_stats = other.intermediate_stats;
    self.sum_of_squares = other.sum_of_squares;
    self.sum_of_squares_elastic = other.sum_of_squares_elastic;
    self.mean = other.mean;
    self.delta_sum_for_squares_elastic = other.delta_sum_for_squares_elastic;
    self.collect(sum);

Comment on lines 415 to 430
let max = if self.intermediate_stats.count == 0 {
None
} else {
Some(self.intermediate_stats.max)
};
let avg = if self.intermediate_stats.count == 0 {
None
} else {
Some(self.mean)
};
let sum_of_squares = if self.intermediate_stats.count == 0 {
None
} else {
Some(self.sum_of_squares_elastic)
};
let variance = if self.intermediate_stats.count <= 1 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

all these different but same ifs are error prone and should be collapsed into one condition

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, my code was inspired by the IntermediateStats one

let min = if self.count == 0 {
None
} else {
Some(self.min)
};
let max = if self.count == 0 {
None
} else {
Some(self.max)
};
let avg = if self.count == 0 {
None
} else {
Some(self.sum / (self.count as f64))
};

should I refactor that one as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also not nice, but there are all the same check so it's easy to proofread

@giovannicuccu
Copy link
Contributor Author

Hi, thanks a lot for your review, this submission contains the code modifications.
This version still shows a memory usage increase (halved since the last submission) because of the introduction of Kahan summation in stats.

(left_val, right_val, epsilon_val) => {
let diff = (left_val - right_val).abs();

if diff > *epsilon_val {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can the code above call this part with epsilon 0.0005

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I got it. Are asking for a change of epsilon in extended_stats tests or for a change in the macro using a fixed value?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the macro has two code paths, one with a parameter and one with epsilon 0.0005. The 0.0005 path can call the variant with the parameter

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, done, thank you.

sum_of_squares_elastic: 0.0,
delta_sum_for_squares_elastic: 0.0,
mean: 0.0,
sigma: 2.0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you document why this is 2.0 ?

sum_of_squares_elastic: f64,
/// delta for sum of squares as computed by elastic search needed for the Kahan algorithm
delta_sum_for_squares_elastic: f64,
// The mean an intermediate value need for calculating the variance
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That comment seems broken

/// The variance of the fast field values. `None` if count is less then 2.
pub variance: Option<f64>,
/// The variance population of the fast field values, always equal to variance. `None` if count
/// is less then 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

less than

if count is 0 or 1 is maybe easier to read

"std_deviation" => Ok(self.std_deviation),
"std_deviation_sampling" => Ok(self.std_deviation_sampling),
"std_deviation_population" => Ok(self.std_deviation_population),
"std_deviation_bounds.lower" => Ok(self
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should match the elastic search name, not the internal name of the property

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I'm missing something, but the names seem to match the ElasticSearch ones https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-extendedstats-aggregation.html

Comment on lines +283 to +284
variance,
variance_population: variance,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why are there two fields with the same value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Basically because of ElasticSearch implementation, the documentation (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics-extendedstats-aggregation.html) states: The std_deviation and variance are calculated as population metrics so they are always the same as std_deviation_population and variance_population respectively.

};
let std_deviation = variance.map(|v| v.sqrt());
let std_deviation_sampling = variance_sampling.map(|v| v.sqrt());
let std_deviation_bounds = if std_deviation.is_none() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you use if let Some(std_deviation, std_deviation_sampling) = instead of is_some + unwrap?

@giovannicuccu
Copy link
Contributor Author

This version should reflect the changes you requested with the exception of the stat labels; they seem to me the same as ElasticSearch

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants