Perf regression, too many small bulk requests #13024

StephanErb · 2024-04-22T06:18:59Z

APM Server version (apm-server version):
8.13.2

Description of the problem including expected versus actual behavior:

I believe there is a regression in the number of bulk requests/s emitted by APM Server towards Elastic. I don't have all data for it, but I believe the issue started in 8.12.x and then got worse with the upgrade to 8.13.x.

This is from the Elastic cloud console: On otherwise identical clients/agents, we see that with the upgrade from 8.12 to 8.13 the number of indexing requests/s jumped noticeably.

The number of documents/s that get ingested didn't really change much, but our hot nodes started to run under higher load. Fortunately, we observed that setting index.translog.durability=async on the all APM indices drops IOPS per hot node back to acceptable levels:

On a decently tuned bulk_max_size setting I'd not expect such a big drop in IOPS/s. I am thus wondering if the APM server is misconfigured in newer versions.

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once elastic/elastic-agent#3797 got merged into the agent?

The text was updated successfully, but these errors were encountered:

axw · 2024-04-23T01:06:49Z

@StephanErb thanks for raising this. I've had a look at the code changes, and nothing obviously stands out. There were some changes in the ES output code, but they're not enabled in APM Server 8.13.x. We'll need to try and reproduce the issue and work back from there.

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

It almost feels like the server uses the agent defaults instead of its custom bulk size option? Was it maybe forgotten to set preset: custom into the server config once elastic/elastic-agent#3797 got merged into the agent?

APM Server has its own ES output implementation, completely independent of Elastic Agent/Beats.

StephanErb · 2024-04-23T21:58:46Z

The number of documents/s that get ingested didn't really change much

Just to clarify: are you referring to the number of documents successfully indexed in Elasticsearch?

We have upgraded the APM Server and Elastic but the APM clients remained unchanged. There was such no change in data forwarded to APM Server and ultimately indexed into Elastic.

Some stats on ingestion from stack monitoring:

Ingestion of 20000 docs/s into .ds-metrics-apm.app.all-default-2024.04.23-004117
Ingestion of 4000 docs/s into .ds-metrics-apm.internal-default-2024.04.21-000931
Ingestion of 2000 docs/s into .ds-traces-apm-default-2024.04.23-001742
Ingestion of 1000 docs/s into .ds-logs-apm.app.all-default-2024.04.17-000162

Please note we are running a reroute pipeline to prevent per-service indices and instead have them all flow into one all datastream for all services to prevent a shard explosion.

When checking GET _tasks?pretty&human&detailed=true&actions=indices:data/write/bulk we see a few hundred parallel tasks such as:

"iaCFp3aFSnuliL3GpI_fnQ:1325096688": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325096688,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[211], indices[metrics-apm <SNIP not to leak service names> traces-apm-default]",
      "start_time": "2024-04-23T20:26:19.872Z",
      "start_time_in_millis": 1713903979872,
      "running_time": "299.7ms",
      "running_time_in_nanos": 299780808,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
      }
},
"iaCFp3aFSnuliL3GpI_fnQ:1325097388": {
      "node": "iaCFp3aFSnuliL3GpI_fnQ",
      "id": 1325097388,
      "type": "transport",
      "action": "indices:data/write/bulk",
      "description": "requests[261], indices[metrics-apm.internal-default, metrics-apm.app <SNIP not to leak service names>]",
      "start_time": "2024-04-23T20:26:20.165Z",
      "start_time_in_millis": 1713903980165,
      "running_time": "61.3ms",
      "running_time_in_nanos": 61304826,
      "cancellable": false,
      "headers": {
        "X-elastic-product-origin": "observability"
    }
}

As per requests[211] and requests[261] my interpretation is we are seeing a batch size of ~200 documents which appears relatively low to me given the overall throughput of the system. (I hope this is correct, I could not find any docs on the format).

This would again be in-line why there was a big drop in IOPS once we moved over to index.translog.durability=async

kruskall · 2024-04-29T08:55:15Z

Hi 👋

go-docappender (that's where the indexing logic is implemented) was bumped between those versions.
The diff of the changes is at elastic/go-docappender@c69a1cf...v1.0.0

I'd like to confirm something going deeper:

Can you confirm no config option was changed between versions ? Specifically flush interval, compression level, flush bytes, max requests and retry backoff settings.
Was there any change to the max memory assigned to APM Server ? (memory limits affect the max buffer size)
How is the number of request measured ? (there were some fixes to some specific metrics)

As mentioned above, the APM Server has a custom ES output implementation so it's not using beats presets.
One thing that could affect the number of request is status codes for the retry config.
We're retrying for 429, 502, 503, 504. Could you check if there is an increase in the number of failed requests or any specific status code ?

StephanErb · 2024-04-29T10:58:36Z

Thanks for picking this up.

We just upgraded in the Cloud console without any additional configuration changes for the APM server (flush interval, etc) or APM server memory changes. We are running two APM servers in the max configuration (2x30 GB).

This is how the load looks like as per Stack Monitoring (incl request stats):

Unfortunately I don't have enough history retention on the monitoring cluster too look at what happened around the time we upgraded.

In our logs we see a relative large number of indexing errors:

10:23:19.114 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.115 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.116 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.117 agent.log [][access]   "POST /intake/v2/events HTTP/" 202
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:366] object mapping for [kafka.consumer.fetch.manager.records.lag] tried to parse field [lag] as object, but found a concrete value
10:23:19.118 agent.log [agent.log][error] failed to index documents in '.ds-metrics-apm.app.all-default-2024.04.29-004154' (document_parsing_exception): [1:156] failed to parse field [hikaricp.connections] of type [double] in document with id 'yIhhKY8BTYD0o0NlX6M_'

Could those errors lead to higher-ingest load now that elastic/go-docappender#99 has landed?

kruskall · 2024-04-30T13:09:14Z

Thank you for the followup!

Are you using any sampling config ? tail based sampling or head based sampling ?

Could those errors lead to higher-ingest load now that elastic/go-docappender#99 has landed?

That specific PR shouldn't impact the number of requests because per-document retries are not enabled in APM Server. It's possible for the ingest load to increase if APM Server receives specific status code and the request retry code is executed.

There are two retry logic:

document retries: this one retries single documents in a bulk request if they return specific status codes. This is not enabled in APM Server and cannot be enabled by config.
request retries: this one retries the HTTP request (bulk index) if ES responds with specific status codes (429, 502, 503, 504).

StephanErb added the bug label Apr 22, 2024

StephanErb changed the title ~~Perf regression, too many requests small bulk requests~~ Perf regression, too many small bulk requests Apr 22, 2024

simitt assigned kruskall Apr 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perf regression, too many small bulk requests #13024

Perf regression, too many small bulk requests #13024

StephanErb commented Apr 22, 2024 •

edited

axw commented Apr 23, 2024

StephanErb commented Apr 23, 2024 •

edited

kruskall commented Apr 29, 2024 •

edited

StephanErb commented Apr 29, 2024

kruskall commented Apr 30, 2024

Perf regression, too many small bulk requests #13024

Perf regression, too many small bulk requests #13024

Comments

StephanErb commented Apr 22, 2024 • edited

axw commented Apr 23, 2024

StephanErb commented Apr 23, 2024 • edited

kruskall commented Apr 29, 2024 • edited

StephanErb commented Apr 29, 2024

kruskall commented Apr 30, 2024

StephanErb commented Apr 22, 2024 •

edited

StephanErb commented Apr 23, 2024 •

edited

kruskall commented Apr 29, 2024 •

edited