Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Monstache did not back off writing data when ElasticSearch disk was full (http code 429), causing log spam #702

Open
ManuelSchmitzberger opened this issue Nov 27, 2023 · 3 comments

Comments

@ManuelSchmitzberger
Copy link

ManuelSchmitzberger commented Nov 27, 2023

Problem: Monstache doesn't back off writing data to Elasticsearch even on http status code 429.

Details:

  • What's happening: Monstache keeps sending data to Elasticsearch even when it's full. Elasticsearch says it's full by sending a 429 error, but Monstache ignores it and keeps trying. Monstache is flooding elasticsearch.
  • Problem caused: This causes a lot of logs to be made and fill up internal monitoring systems.

What Should Happen:
Monstache should stop and wait before trying again when it gets a 429 error from Elasticsearch. This would help prevent too many logs and does not flood internal monitoring systems.

@mologie
Copy link

mologie commented Nov 27, 2023

Hi, colleague of Manuel here. The specific error message we got was

ERROR 2023/11/24 15:43:43 Bulk response item: {"_index":"main.<col>","_id":"<id>","status":429,"error":{"type":"cluster_block_exception","reason":"index [main.<col>] blocked by: [TOO_MANY_REQUESTS/12/disk usage exceeded flood-stage watermark, index has read-only-allow-delete block];"}}

It was repeated 24 500 000 times in a duration of 10 minutes, totaling roughly 4 GiB of logs.

The steps to reproduce are (though we did not investigate yet whether these can be minimized):

  1. Deny access to the monstache user, so that some data is queued up
  2. Let Elasticsearch run almost full
  3. Stop monstache
  4. Restore access for monstache
  5. Restart monstache
  6. Let Elasticsearch run completely full (up to the flood-stage watermark)
  7. Observe that monstache begins to rapidly generate log events (2+ million log entries per minute)

@mologie
Copy link

mologie commented Nov 27, 2023

Additionally here is a redacted copy of the config file with which we observed the issue:

mongo-url = "mongodb://monstache:<snip:url>"
elasticsearch-urls = ["http://<snip>:9200"]
direct-read-namespaces = ["main.<snip:col>"]
change-stream-namespaces = ["main.<snip:col>"]
workers = ["worker-0", "worker-1"]
gzip = false
stats = true
index-stats = true
elasticsearch-user = "monstache"
elasticsearch-password = "<snip>"
elasticsearch-max-conns = 4
elasticsearch-validate-pem-file = false
elasticsearch-healthcheck-timeout-startup = 200
elasticsearch-healthcheck-timeout = 200
dropped-collections = true
dropped-databases = true
replay = true
resume = true
resume-write-unsafe = false
resume-name = "default"
resume-strategy = 1
index-files = true
file-highlighting = true
file-namespaces = ["users.fs.files"]
verbose = false
cluster-name = 'elasticsearch'
exit-after-direct-reads = false

I'm curious and investigating possible causes in the source code right now. A brief look tells me that the ElasticSearch library just indiscriminately calls the error handler for everything thrown at it via Add(), so if the ingress side works / provides data we'll end up with one error per ingested item. It's unclear to me however at which point throttling should best take place.

@rwynn
Copy link
Owner

rwynn commented Dec 2, 2023

Hi, pushed a new release to back off when indexing errors happen to mitigate the log flooding.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants