Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perf regression, too many small bulk requests #13024

Open
StephanErb opened this issue Apr 22, 2024 · 5 comments
Open

Perf regression, too many small bulk requests #13024

StephanErb opened this issue Apr 22, 2024 · 5 comments
Assignees
Labels

Comments

@StephanErb
Copy link

StephanErb commented Apr 22, 2024

APM Server version (apm-server version):
8.13.2

Description of the problem including expected versus actual behavior:

I believe there is a regression in the number of bulk requests/s emitted by APM Server towards Elastic. I don't have all data for it, but I believe the issue started in 8.12.x and then got worse with the upgrade to 8.13.x.

This is from the Elastic cloud console: On otherwise identical clients/agents, we see that with the upgrade from 8.12 to 8.13 the number of indexing requests/s jumped noticeably.

Screenshot_2024-04-16_at_23 59 50

The number of documents/s that get ingested didn't really change much, but our hot nodes started to run under higher load. Fortunately, we observed that setting index.translog.durability=async on the all APM indices drops IOPS per hot node back to acceptable levels:

Screenshot_2024-04-21_at_21 35 36

On a decently tuned bulk_max_size setting I'd not expect such a big drop in IOPS/s. I am thus wondering if the APM server is misconfigured in newer versions.

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once elastic/elastic-agent#3797 got merged into the agent?

@StephanErb StephanErb added the bug label Apr 22, 2024
@StephanErb StephanErb changed the title Perf regression, too many requests small bulk requests Perf regression, too many small bulk requests Apr 22, 2024
@axw
Copy link
Member

axw commented Apr 23, 2024

@StephanErb thanks for raising this. I've had a look at the code changes, and nothing obviously stands out. There were some changes in the ES output code, but they're not enabled in APM Server 8.13.x. We'll need to try and reproduce the issue and work back from there.

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once elastic/elastic-agent#3797 got merged into the agent?

APM Server has its own ES output implementation, completely independent of Elastic Agent/Beats.

@StephanErb
Copy link
Author

StephanErb commented Apr 23, 2024

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

We have upgraded the APM Server and Elastic but the APM clients remained unchanged. There was such no change in data forwarded to APM Server and ultimately indexed into Elastic.

Some stats on ingestion from stack monitoring:

  • Ingestion of 20000 docs/s into .ds-metrics-apm.app.all-default-2024.04.23-004117
  • Ingestion of 4000 docs/s into .ds-metrics-apm.internal-default-2024.04.21-000931
  • Ingestion of 2000 docs/s into .ds-traces-apm-default-2024.04.23-001742
  • Ingestion of 1000 docs/s into .ds-logs-apm.app.all-default-2024.04.17-000162

Please note we are running a reroute pipeline to prevent per-service indices and instead have them all flow into one all datastream for all services to prevent a shard explosion.

When checking GET _tasks?pretty&human&detailed=true&actions=indices:data/write/bulk we see a few hundred parallel tasks such as:

"iaCFp3aFSnuliL3GpI_fnQ:1325096688": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325096688,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[211], indices[metrics-apm <SNIP not to leak service names> traces-apm-default]",
      "start_time": "2024-04-23T20:26:19.872Z",
      "start_time_in_millis": 1713903979872,
      "running_time": "299.7ms",
      "running_time_in_nanos": 299780808,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
      }
},
"iaCFp3aFSnuliL3GpI_fnQ:1325097388": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325097388,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[261], indices[metrics-apm.internal-default, metrics-apm.app <SNIP not to leak service names>]",
      "start_time": "2024-04-23T20:26:20.165Z",
      "start_time_in_millis": 1713903980165,
      "running_time": "61.3ms",
      "running_time_in_nanos": 61304826,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
    }
}

As per requests[211] and requests[261] my interpretation is we are seeing a batch size of ~200 documents which appears relatively low to me given the overall throughput of the system. (I hope this is correct, I could not find any docs on the format).

This would again be in-line why there was a big drop in IOPS once we moved over to index.translog.durability=async

@kruskall
Copy link
Member

kruskall commented Apr 29, 2024

Hi 👋

go-docappender (that's where the indexing logic is implemented) was bumped between those versions.
The diff of the changes is at elastic/go-docappender@c69a1cf...v1.0.0

I'd like to confirm something going deeper:

  • Can you confirm no config option was changed between versions ? Specifically flush interval, compression level, flush bytes, max requests and retry backoff settings.
  • Was there any change to the max memory assigned to APM Server ? (memory limits affect the max buffer size)
  • How is the number of request measured ? (there were some fixes to some specific metrics)

As mentioned above, the APM Server has a custom ES output implementation so it's not using beats presets.
One thing that could affect the number of request is status codes for the retry config.
We're retrying for 429, 502, 503, 504. Could you check if there is an increase in the number of failed requests or any specific status code ?

@StephanErb
Copy link
Author

Thanks for picking this up.

We just upgraded in the Cloud console without any additional configuration changes for the APM server (flush interval, etc) or APM server memory changes. We are running two APM servers in the max configuration (2x30 GB).

This is how the load looks like as per Stack Monitoring (incl request stats):
Screenshot 2024-04-29 at 12 45 11
Screenshot 2024-04-29 at 12 45 20

Unfortunately I don't have enough history retention on the monitoring cluster too look at what happened around the time we upgraded.

In our logs we see a relative large number of indexing errors:

10:23:19.114 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.115 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.116 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.117 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:156] failed to parse field [hikaricp.connections] of type [double] in document with id 'yIhhKY8BTYD0o0NlX6M_'

Could those errors lead to higher-ingest load now that elastic/go-docappender#99 has landed?

@kruskall
Copy link
Member

Thank you for the followup!

Are you using any sampling config ? tail based sampling or head based sampling ?


Could those errors lead to higher-ingest load now that elastic/go-docappender#99 has landed?

That specific PR shouldn't impact the number of requests because per-document retries are not enabled in APM Server. It's possible for the ingest load to increase if APM Server receives specific status code and the request retry code is executed.

There are two retry logic:

  • document retries: this one retries single documents in a bulk request if they return specific status codes. This is not enabled in APM Server and cannot be enabled by config.
  • request retries: this one retries the HTTP request (bulk index) if ES responds with specific status codes (429, 502, 503, 504).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants