Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Histogram reuse bug in query evaluation #14072

Closed
faustuzas opened this issue May 9, 2024 · 4 comments
Closed

Histogram reuse bug in query evaluation #14072

faustuzas opened this issue May 9, 2024 · 4 comments

Comments

@faustuzas
Copy link
Contributor

faustuzas commented May 9, 2024

What did you do?

I have noticed a bug in histogram memory reuse when (chunkenc.Iterator).AtFloatHistogram returns multiple histograms with the same address and does not use the newly added destination histogram functionality.

Exemplary way to reproduce the bug:

  1. The Iterator returns two consecutive histograms that point to the same address X.
  2. The histograms slice now looks like this: [X, X (.....)], () denotes part of the slice where len <= i < cap.
  3. The iterator returns another histogram with the address Y where Y != X.
  4. The first slot in histograms slice which is pointing to X gets pivoted to the unused back of the slice in:

    prometheus/promql/engine.go

    Lines 2134 to 2138 in 312e3fd

    tail := make([]HPoint, drop)
    copy(tail, histograms[:drop])
    copy(histograms, histograms[drop:])
    copy(histograms[len(histograms)-drop:], tail)
    histograms = histograms[:len(histograms)-drop]
  5. Now the histograms slice is [X, Y (X, .....)].
  6. On arrival of another histogram, slice length is increased and the new histogram is copied to the address X, mutating histograms at i=0 and i=2:

    prometheus/promql/engine.go

    Lines 2169 to 2175 in 312e3fd

    n := len(histograms)
    if n < cap(histograms) {
    histograms = histograms[:n+1]
    } else {
    histograms = append(histograms, HPoint{H: &histogram.FloatHistogram{}})
    }
    histograms[n].T, histograms[n].H = buf.AtFloatHistogram(histograms[n].H)

Demo unit test reproducing the bug: faustuzas#1

Debug logs from the test above printing the inputs to the rate function inside the PromQL engine:

Input into function:
 => [2024-05-09 12:00:00 +0300 EEST] 300000.00 (0x14000130240)
Input into function:
 => [2024-05-09 12:00:00 +0300 EEST] 300000.00 (0x14000130240)
Input into function:
 => [2024-05-09 12:00:00 +0300 EEST] 300000.00 (0x14000130240)
 => [2024-05-09 12:02:00 +0300 EEST] 300000.00 (0x14000130240)
Input into function:
 => [2024-05-09 12:00:00 +0300 EEST] 300000.00 (0x14000130240)
 => [2024-05-09 12:02:00 +0300 EEST] 300000.00 (0x14000130240)
Input into function:
 => [2024-05-09 12:00:00 +0300 EEST] 300000.00 (0x14000130240)
 => [2024-05-09 12:02:00 +0300 EEST] 300000.00 (0x14000130240)
Input into function:
 => [2024-05-09 12:00:00 +0300 EEST] 300000.00 (0x14000130240)
 => [2024-05-09 12:02:00 +0300 EEST] 300000.00 (0x14000130240)
 => [2024-05-09 12:05:00 +0300 EEST] 600000.00 (0x140001302d0)
Input into function:
 => [2024-05-09 12:02:00 +0300 EEST] 750000.00 (0x14000130240) <-- bug happens and rate function detects a reset
 => [2024-05-09 12:05:00 +0300 EEST] 600000.00 (0x140001302d0)
 => [2024-05-09 12:05:30 +0300 EEST] 750000.00 (0x14000130240)
Input into function:
 => [2024-05-09 12:02:00 +0300 EEST] 750000.00 (0x14000130240)
 => [2024-05-09 12:05:00 +0300 EEST] 600000.00 (0x140001302d0)
 => [2024-05-09 12:05:30 +0300 EEST] 750000.00 (0x14000130240)
Input into function:
 => [2024-05-09 12:05:00 +0300 EEST] 600000.00 (0x140001302d0)
 => [2024-05-09 12:05:30 +0300 EEST] 750000.00 (0x14000130240)

The problem is absent when line

histograms[n].T, histograms[n].H = it.AtFloatHistogram(histograms[n].H)
is changed to:

t, fh := it.AtFloatHistogram(histograms[n].H)
histograms[n].T, histograms[n].H = t, fh.Copy()

However, this defeats the whole purpose of having a destination histogram in the iterator.

What did you expect to see?

No response

What did you see instead? Under which circumstances?


System information

No response

Prometheus version

main branch

Prometheus configuration file

No response

Alertmanager version

No response

Alertmanager configuration file

No response

Logs

No response

@faustuzas
Copy link
Contributor Author

Hey @fpetkovski @krajorama, I've seen you working on optimisations around histograms in PromQL engine recently. Perhaps you could take a look at this bug?

I've tried to come up with a fix, however, promql engine code is very complex and this bug requires deep knowledge about reusing histograms in the iterators to fix it properly.

@fpetkovski
Copy link
Contributor

I will take a look!

@fpetkovski
Copy link
Contributor

@faustuzas I took at look at the bug and I don't think that the iterator stub used in tests is implemented correctly. I've made a suggestion for the correct implementation: https://github.com/faustuzas/prometheus/pull/1/files#r1597967140.

I am curious how you discovered the bug. Did you see it inside Prometheus or are you using the engine or some other Prometheus package in a different project?

@faustuzas
Copy link
Contributor Author

Hey, @fpetkovski, thank you for taking a look. Yes, this bug was discovered outside the Prometheus.

Your suggested change to the stub iterator makes sense, the semantics of the engine have changed a little with the addition of histogram re-use and the reduction of allocations.

I will close the issue, thanks for the help again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants