Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ElasticSearch 7.x too_many_buckets_exception #17327

Closed
bhozar opened this issue May 28, 2019 · 56 comments
Closed

ElasticSearch 7.x too_many_buckets_exception #17327

bhozar opened this issue May 28, 2019 · 56 comments
Assignees
Labels
datasource/Elasticsearch prio/high Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature-request

Comments

@bhozar
Copy link

bhozar commented May 28, 2019

What happened:
Upgraded to ES 7.x and Grafana 6.2.x. Some panels relying on ES datasource was showing "Unknown elastic error response" in top left corner.

Query inspector displayed this error:

caused_by:Object
type:"too_many_buckets_exception"
reason:"Trying to create too many buckets. Must be less than or equal to: [10000] but was [10001]. This limit can be set by changing the [search.max_buckets] cluster level setting."
max_buckets:10000

What you expected to happen:
Graph to display 3 hours of data from front end proxy logs stored in ElasticSearch 7.x.

How to reproduce it (as minimally and precisely as possible):
Query a lot of data
image

Environment:

  • Grafana version: 6.2.1
  • Data source type & version: ES 7.0
  • OS Grafana is installed on: Ubuntu 18.04
  • User OS & Browser: Win10/Chrome
@marefr
Copy link
Member

marefr commented May 28, 2019

As the error message from Elasticsearch says "This limit can be set by changing the [search.max_buckets] cluster level setting.". I don't see how Grafana can do something to resolve this.

To minimize these either add change min time interval on datasource or panel level or either add min doc count on date histogram to 1.

@marefr marefr closed this as completed May 28, 2019
@rnd-ash
Copy link

rnd-ash commented May 29, 2019

Surely Grafana can do something here.

I've noticed that since Elasticsearch 7.x Elasticsearch now counts the terms aggregation towards bucket size, rather than just the date historgram. Kibana prevents this error by automatically widening the date histogram resolution when selecting a larger time interval. I found Kibana does this for the visual builder:

Panel time range -> Date historgram resolution
15 minutes -> 10 second
30 minutes -> 15 seconds
1 hour -> 30 seconds
4 hours -> 1 minute
12 hours -> 1 minute
24 hours -> 5 minutes
48 hours -> 10 minutes
7 days -> 1 hour

It appears although Grafana can automatically widen the date historgram time range, it is still making elasticsearch return too many buckets.

Maybe there could be a way for us to specify time resolutions based on our date pickers time range?

@bhozar
Copy link
Author

bhozar commented Jun 5, 2019

I'm guessing I'm one of very few either experiencing this issue, or not many are running ES 7 yet.

Changing the min doc count to much higher has little effect, and changing the minimum time interval works fine if you are only looking at an hour of data, but as you expand the time range then fails. I also changed the ES setting to 100k, but Grafana is still requesting too fine a time grain.

If there was an option to set not only the minimum time value, but the full time range to histogram resolution it would probably work.

@bh9
Copy link

bh9 commented Jun 5, 2019

Grafana should be using elasticsearch's scroll API (https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-scroll.html) for this. Increasing search.max_buckets above 10000 has no effect because elasticsearch hard-caps it at 10000

@Ivan-Strahovsky
Copy link

I'm surprised how this issue underrated, faced with same problem, changing interval in panel or data source helps. But usually we look at metrics daily and want to see it with small granularity, we also want to look at metrics weekly/monthly etc., to achieve this I have to change min interval in datasource/panel or have different dashboards with different interval set - this is not convenient.

@marefr
Copy link
Member

marefr commented Jun 13, 2019

Seems to be more and more hitting this problem so reopening the issue.

@marefr
Copy link
Member

marefr commented Jun 13, 2019

Not exactly sure though that it's as simple as extending the automatic intervals to solve this problem. As far as I understand this also depends on how many terms aggregations and buckets you get in total so not easy to solve in Grafana.

Some context to why they added the search.max_bucket setting: https://discuss.elastic.co/t/requesting-background-info-on-search-max-buckets-change/130334

To me it sounds like you still should be able to configure search.max_bucket to -1 in ES7 similar to how it per default behaved in ES6, but haven't had time to confirm this. Please try this out and let me know the result.

Looking at Kibana seems like they still have similar problems in at least some parts: elastic/kibana#36892

One of the commenters suggest

Run the aggregation via a composite aggregation in order to be able to paginate through results.

Kibana have this related issue open regarding composite aggregations: elastic/kibana#36358

I have never used composite aggregation and I currently know too little about it and why that would be better alternative than the regular aggregations. Also seems like composite aggregations is only supported from ES 6.1 and forward.

@marefr
Copy link
Member

marefr commented Jun 26, 2019

Just to verify, does changing the max concurrent shard request setting to 5 makes this better?

@DenKn
Copy link

DenKn commented Jul 2, 2019

It seems I have the same issue "Unknown elastic error response".
I have events from 27.05 until now in ES.
And if I set Quick Range Last 90 days (in grafana) I get error but If I set Last 30 days or Last 6 months there is no an error.

@lstyles
Copy link

lstyles commented Jul 3, 2019

@marefr -1 isn't a valid option for search.max_buckets setting. It returns an error that says it needs value >= 0.

Setting it to 0 is possible but than nothing seems to be getting returned.

@cpmoore
Copy link

cpmoore commented Jul 13, 2019

I'm facing the same issue with a count panel grouped by two terms and Date Histogram
It works fine up to the last 5 hours, when I attempt to view the last 6 hours it give the issue regarding the 10000 buckets.
I attempted to change my search.max_buckets setting on my cluster to 15000, but then the error said
Must be less than or equal to: [15000] but was [15001], still 1 more than my cluster setting.

Setting the max concurrent shard to 5 did not help.

It does appear that setting a higher min time interval allows the graph to work, but it also groups more points together and reduces the precision of the data. I have the default set as 30s changing to 60s lets the last 6 hours work.

@RedStalker
Copy link

Hello everyone. Also faced with such problem after we start the migration to 7.1 version of ELK.
Increasing the search.max_buckets value doesn't help much - it will always result in an error that it's over the limit.

@Akaoni
Copy link

Akaoni commented Aug 13, 2019

+1.

@torkelo
Copy link
Member

torkelo commented Aug 13, 2019

Kibana prevents this error by automatically widening the date histogram resolution when selecting a larger time interval. I found Kibana does this for the visual builder:

Grafana does the exact same thing if you set the date histogram interval to auto.

@Cylox
Copy link

Cylox commented Aug 14, 2019

Having the same problem. Setting the date histogram interval to auto does not help. Cannot create a histogram that aggregates data from the last few days while before the update it was basically possible to view arbitrary time ranges. Interestingly enough a table panel with the exactly same data source does work.

@M0rdecay
Copy link

Having the same problem too.
Grafana - 6.2.5
ES - 7.3.0

@WeilunZ
Copy link

WeilunZ commented Sep 9, 2019

Increasing the Min time interval works but when you increase your time range, you must change the Min time interval value again.

@fjlour
Copy link

fjlour commented Sep 9, 2019

Also reporting the same problem as described here. Expanding the timerange on a high time resolution dataset (fine grain) will cause this error. Perhaps Grafana should adapt the group by time to a wider window as the user expands the time range, in order to bring the data more aggregated.

If my data has a min resolution of milisseconds, there's no need to bring millions of documents to be displayed in a 3 month chart. Data should be aggregated at ES at query level.

@Theoooooo
Copy link

Theoooooo commented Oct 2, 2019

I'm also seing this issue in the explore panel for the version 6.4.1 of Grafana.
When choosing greater time range (more than 1 hour) i got an Unknown Elastic Error Response because the query return too many buckets to aggregate or display. But that occur only in the "Logs" tab and not the "Metrics" tab.

There is also something to do there to help display the informations without having to modify options into elastic

@UkrZilla
Copy link

UkrZilla commented Oct 7, 2019

Having the same problem too.
Grafana - 6.4.1
ES - 7.4.0

Also think Grafana should use scroll API

@cpmoore
Copy link

cpmoore commented Oct 9, 2019

It may be possible to wrap the date histogram aggregation in a composite aggregation then paginate between the results and combine them client side.

@flunda
Copy link

flunda commented Oct 10, 2019

Same problem here.
ES - 7.2.0
Grafana - 6.4.2

@unglaublicherdude
Copy link

unglaublicherdude commented Oct 10, 2019

Same problem here:
ES - 7.3.2
Grafana - 6.4.2

@CRad14
Copy link

CRad14 commented Oct 11, 2019

Same Problem here
ES 7.1
Grafana 6.4.2

@ywsong219
Copy link

Same issue..
ES 7.3.2
Grafana 6.4.0

@marefr
Copy link
Member

marefr commented Dec 30, 2019

@redNixon that's definitely a bug. Thanks for reporting.

@berglh
Copy link

berglh commented Mar 25, 2020

Surely Grafana can do something here.

I've noticed that since Elasticsearch 7.x Elasticsearch now counts the terms aggregation towards bucket size, rather than just the date historgram. Kibana prevents this error by automatically widening the date histogram resolution when selecting a larger time interval. I found Kibana does this for the visual builder:

Panel time range -> Date historgram resolution
15 minutes -> 10 second
30 minutes -> 15 seconds
1 hour -> 30 seconds
4 hours -> 1 minute
12 hours -> 1 minute
24 hours -> 5 minutes
48 hours -> 10 minutes
7 days -> 1 hour

It appears although Grafana can automatically widen the date historgram time range, it is still making elasticsearch return too many buckets.

Maybe there could be a way for us to specify time resolutions based on our date pickers time range?

Elasticsearch will return whatever you ask it to. The interval is a client side parameter - the problem with the implementation in Grafana is that it uses the "Date Histogram" aggregation and then doesn't scale the "interval" parameter. The whole concept of Auto isn't supported by the Date Histogram aggregation method in the Search API at all and is misleading when this option is presented in Grafana. I believe this confusion occurs because when people choose Auto in Kibana, it dynamically changes the interval size, but perhaps not how you you might think.

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-datehistogram-aggregation.html

Looking at the API documentation, the interval is supplied by the client at query. There is no feature in this part of the search API to "Auto" time interval.

The problem when looking at large time series is that even though you may have < 10000 buckets, those buckets have many large shards or you are performing Term sub-aggregations along with the Date Histogram which adds more total buckets (sub queries) to the parent aggregation. That for me results in Java OOM errors in Elasticsearch. If your query generated more than 10000 buckets, you will hit the too many buckets exception as in the OP. As people have mentioned, if you manually set the Min Time Interval, you basically increase the stability of the query by reducing the total aggregation buckets. While this might work in some limited situations, it is always a trade off when zooming in to small time periods (Very large time buckets reduce resolution of visualisation) or zooming out to larger time frames (OOM/Too Many Buckets)

While a solution could be coded into Grafana to scale the time interval to something they feel is sensible per quick time interval pick, the obvious solution for this in my humble opinion is to expose the Auto Date Histogram aggregation method in the Elasticsearch Search API to the Group By section in Grafana. This will allow the user to define the max number of time buckets a given visualisation should return, similar to the auto time interval in Kibana. You can check out the examples here.

https://www.elastic.co/guide/en/elasticsearch/reference/6.8/search-aggregations-bucket-autodatehistogram-aggregation.html

The user is then in control of selecting the maximum time buckets per query which allows the user to control how heavy/detailed each query is and then have Elasticsearch scale the buckets over larger time frames. I think this would be a killer feature for the Elasticsearch data source in Grafana and provide a similar experience to the default Date Aggregations in Kibana :)

@esseti
Copy link

esseti commented Apr 17, 2020

jumping on the discussion, is there a way to change the interval value in a dashboard in an easy way? the auto method does not work for me, so i would be happy to ahve a single point where to change it.

@Augustin-FL
Copy link

Augustin-FL commented Apr 18, 2020

Hi all,

For those who want to get rid of search.max_buckets,

  • Setting the value doesn't work (You get the error Failed to parse value [-1] for setting [search.max_buckets] must be >= 0 when you try)
  • However, you can set it to the maximum accepted value (2^31-1):
PUT _cluster/settings
{
  "persistent":{
  "search.max_buckets":"2147483647"
  }
}

That effectively disable the setting.

For information, this setting is currently being deprecated (see elastic/elasticsearch#51731 )

@JonasDeGendt
Copy link

Hi all,

For those who want to get rid of search.max_buckets,

* Setting the value doesn't work (You get the error `Failed to parse value [-1] for setting [search.max_buckets] must be >= 0` when you try)

* However, you can set it to the maximum accepted value (2^31-1):
PUT _cluster/settings
{
  "persistent":{
  "search.max_buckets":"2147483647"
  }
}

That effectively disable the setting.

For information, this setting is currently being deprecated (see elastic/elasticsearch#51731 )

This works flawlessly, thanks a lot!

@UkrZilla
Copy link

Hi guys,

Have a good news for you:
elastic/elasticsearch#46751

According to elastic/elasticsearch#55266

We introduced a new search.check_buckets_step_size setting to
better control how the coordinating node allocates memory when aggregating
buckets. The allocation of buckets is now be done in steps, each step
allocating a number of buckets equal to this setting. To avoid an OutOfMemory
error, a parent circuit breaker check is performed on allocation.

@s1sfa
Copy link

s1sfa commented Apr 21, 2020

I think it would be ideal if grafana handled time interval dynamically based on time range like kibana. If you want per second values over multiple days. It doesn't make computational sense to request every 1 second of multiple days from elasticsearch.

@frittentheke
Copy link

I think it would be ideal if grafana handled time interval dynamically based on time range like kibana. If you want per second values over multiple days. It doesn't make computational sense to request every 1 second of multiple days from elasticsearch.

Exactly that (as I also suggested in my comment above if I may say so :-) ).

@berglh
Copy link

berglh commented Apr 21, 2020

@frittentheke @s1sfa I think Grafana shouldn't be responsible for managing the scaling when this feature is available in the Elasticsearch Search API already. Just need to add the Autodate Histogram Aggregation instead of the regular Date Histogram Aggregation to the Elasticsearch data source in Grafana, then Elasticsearch will scale the buckets accordingly to the time range query requested.

@frittentheke
Copy link

frittentheke commented Apr 22, 2020

@berglh while the new functionality migh be helpful, very helpful even it's not as simple as to just "use the right query or function". The Auto-interval Date Histogram Aggregation will comfortably create buckets for a certain interval sensible to greate the graph. But even if using this there could be cases (querying i.e. for counts of individual termins) in which Grafana still needs to deal with the case that the selected / requested time interval cannot be queried without causing too many buckets to be created. But certainly it's best to use as much of the storage backend functionality to optimize the querying server-side - sorry for not properly diving into the discussion with my last post.

@s1sfa
Copy link

s1sfa commented Apr 22, 2020

@berglh - I like the idea but I think it would need a bit of testing on grafana elastic query building. I tried to simply swap out date_histogram for auto_date_histogram and it appears to not work with other aggregations, like sum. {'reason': 'The first aggregation in buckets_path must be a multi-bucket aggregation. Secondly. With this elastic way or not, the ability to scale to a specified interval is pretty important. If I want a per second rate or something, auto date doesn't have any parameters to know, but it does return the interval it used. Which would be pretty similar to grafana just changing the interval on a regular query then doing the division to get the intended values.

The max_bucket being at a low threshold is mostly an elasticsearch problem which it looks like they are improving in new versions. But if we think about trying to get one month worth of per second data on a graph, some sort of auto scaling needs to exist. Whether grafana is making the decision based on some source parameters or elastic search autodate is figured out and grafana has the ability to do a calculation on the interval to have values in the desired interval, like per second.

@berglh
Copy link

berglh commented Apr 22, 2020

@s1sfa Thanks for trying it out :) Just to clarify my position if I wasn't clear, I'm not suggesting we swap it out directly. I think both types of aggregations are useful depending on the case of the visualisation. I just think providing it as an option as a query type for Elasticsearch data source would be useful for people wanting a more Kibana like experience when creating a dashboard in Grafana.

The max_bucket being at a low threshold is mostly an elasticsearch problem which it looks like they are improving in new versions.

I read the recent improvement in this area that they are looking to handle the circuit breaking of long running queries more reliably to prevent Out of Memory errors. The performance of Elasticsearch has always been improving, increasing stability under larger queries over time - so you are probably right.

Still, I would doubt that there is never a condition where a query will hit a circuit breaker and return a different error like "unable to service the query due to exceeding circuit breaker" due to effectively the cluster determining too many buckets being the cause of the issue. These types of problems will probably occur less with solid state storage, on spinning disk clusters though with datasets many times larger than the combined JVM heap of the cluster, or histograms split by a big number of term sub aggregations will always run into issues with buckets one way or another.

@frittentheke Giving the user the ability to set a specific integer for the "buckets" parameter of the auto-date histogram query method would let the user tune the graph to the performance characteristics of the dataset and hardware performance. There is nothing stopping a user requesting the last 10 years of data and the query still timing out or hitting some other Elasticsearch performance issue - I figure there is only so much hand holding Grafana can do. I still think there is benefit at least we can give the user an option for an auto-interval scaling solution.

It's up to the Grafana community and data source maintainers to decide if an auto-interval scaling solution should be handled by Grafana, and if there are any trade-offs with metrics style aggregations as you pointed out @s1sfa - I don't have enough experience with this query to say whether it's even worth implementing, I just read the manual and was voicing an opinion based on that limited information - it reads like an easy win to give the user control over to auto-interval to a sensible bucket limit on a case by case basis 😳

@aocenas aocenas added needs investigation for unconfirmed bugs. use type/bug for confirmed bugs, even if they "need" more investigating prio/high Must be staffed and worked on either currently, or very soon, ideally in time for the next release. labels Aug 26, 2020
narkisr pushed a commit to re-ops/re-dock that referenced this issue Sep 20, 2020
@berglh
Copy link

berglh commented Sep 25, 2020

I believe this issue is actually closed by this commit: #21937. You can now set the maximum data points per visualisation which then automatically calculates the time interval of the aggregation buckets. Between setting your maximum sub aggregation size limits and the max data points, you get a nicely scaling solutions with the aggregation filter. 🎉 I am running Grafana latest from Docker Hub v7.2.0 (efe4941).

image

@Elfo404 Elfo404 self-assigned this Sep 28, 2020
@Elfo404
Copy link
Member

Elfo404 commented Oct 5, 2020

@berglh Thanks for bringing this up, and you are right, this should be fixed in the 6.6.2 release with #21937.
I'm closing this issue, if someone is still facing this problem we can reopen it 🙂

@Elfo404 Elfo404 closed this as completed Oct 5, 2020
Observability (deprecated, use Observability Squad) automation moved this from Backlog features to Done Oct 5, 2020
@zoltanbedi zoltanbedi removed the needs investigation for unconfirmed bugs. use type/bug for confirmed bugs, even if they "need" more investigating label Nov 10, 2020
@eertul
Copy link

eertul commented Nov 24, 2023

Hello, We have Elasticsearch 7.15 and Grafana 9.4.7. We still face this problem.

@Elfo404
Copy link
Member

Elfo404 commented Nov 24, 2023

9.4.x is way past EOL and not supported anymore. does this still happen with a more recent (supported) version?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasource/Elasticsearch prio/high Must be staffed and worked on either currently, or very soon, ideally in time for the next release. type/feature-request
Development

No branches or pull requests