Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam #702

ManuelSchmitzberger · 2023-11-27T06:48:16Z

Problem: Monstache doesn't back off writing data to Elasticsearch even on http status code 429.

Details:

What's happening: Monstache keeps sending data to Elasticsearch even when it's full. Elasticsearch says it's full by sending a 429 error, but Monstache ignores it and keeps trying. Monstache is flooding elasticsearch.
Problem caused: This causes a lot of logs to be made and fill up internal monitoring systems.

What Should Happen:
Monstache should stop and wait before trying again when it gets a 429 error from Elasticsearch. This would help prevent too many logs and does not flood internal monitoring systems.

mologie · 2023-11-27T10:43:58Z

Hi, colleague of Manuel here. The specific error message we got was

ERROR 2023/11/24 15:43:43 Bulk response item: {"_index":"main.<col>","_id":"<id>","status":429,"error":{"type":"cluster_block_exception","reason":"index [main.<col>] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"}}

It was repeated 24 500 000 times in a duration of 10 minutes, totaling roughly 4 GiB of logs.

The steps to reproduce are (though we did not investigate yet whether these can be minimized):

Deny access to the monstache user, so that some data is queued up
Let Elasticsearch run almost full
Stop monstache
Restore access for monstache
Restart monstache
Let Elasticsearch run completely full (up to the flood-stage watermark)
Observe that monstache begins to rapidly generate log events (2+ million log entries per minute)

mologie · 2023-11-27T10:56:06Z

Additionally here is a redacted copy of the config file with which we observed the issue:

mongo-url = "mongodb://monstache:<snip:url>"
elasticsearch-urls = ["http://<snip>:9200"]
direct-read-namespaces = ["main.<snip:col>"]
change-stream-namespaces = ["main.<snip:col>"]
workers = ["worker-0", "worker-1"]
gzip = false
stats = true
index-stats = true
elasticsearch-user = "monstache"
elasticsearch-password = "<snip>"
elasticsearch-max-conns = 4
elasticsearch-validate-pem-file = false
elasticsearch-healthcheck-timeout-startup = 200
elasticsearch-healthcheck-timeout = 200
dropped-collections = true
dropped-databases = true
replay = true
resume = true
resume-write-unsafe = false
resume-name = "default"
resume-strategy = 1
index-files = true
file-highlighting = true
file-namespaces = ["users.fs.files"]
verbose = false
cluster-name = 'elasticsearch'
exit-after-direct-reads = false

I'm curious and investigating possible causes in the source code right now. A brief look tells me that the ElasticSearch library just indiscriminately calls the error handler for everything thrown at it via Add(), so if the ingress side works / provides data we'll end up with one error per ingested item. It's unclear to me however at which point throttling should best take place.

rwynn · 2023-12-02T19:55:27Z

Hi, pushed a new release to back off when indexing errors happen to mitigate the log flooding.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam #702

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam #702

ManuelSchmitzberger commented Nov 27, 2023 •

edited

mologie commented Nov 27, 2023

mologie commented Nov 27, 2023

rwynn commented Dec 2, 2023

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam #702

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam #702

Comments

ManuelSchmitzberger commented Nov 27, 2023 • edited

mologie commented Nov 27, 2023

mologie commented Nov 27, 2023

rwynn commented Dec 2, 2023

ManuelSchmitzberger commented Nov 27, 2023 •

edited