Delete by query causing fielddata cache spike leading to 429 #2550

blacar · 2023-05-12T15:49:20Z

This ticket is the result of two weeks of experiments.

I'll try to put all the information because It might be something wrong with RestHighLevelClient doing deleteByQuery.
I have been two weeks betting that it should be a problem on my side or a problem on Elastic (performance, configuration) but after several experiments I need to present this to you because I have no explanation.

First of all, I have prior knowledge of Elastic and I am aware that updates and deletes are expensive operations, this is not about that.

CONTEXT

This is a spring-boot microservice running on Java 11 using spring-data-elasticsearch 4.2.11 to run operations on elastic cluster.
We are on pre-launch experiments and I have an environment that mirrors our production traffic but allows me total control of it.
We have a lot of ingest operations, a lot of query operations, some significant rate of update operations, and few delete operations.

We are using RestHighLevelClient configured like this:

  public RestHighLevelClient elasticsearchClient() {
    final HttpHeaders compatibilityHeaders = new HttpHeaders();
    compatibilityHeaders.add("Accept", "application/vnd.elasticsearch+json;compatible-with=7");
    compatibilityHeaders.add("Content-Type", "application/vnd.elasticsearch+json;"
      + "compatible-with=7");
    final ClientConfiguration clientConfiguration = ClientConfiguration.builder()
      .connectedTo(eshostname + ":" + esport)
      .usingSsl()
      .withBasicAuth(username, password)
      .withDefaultHeaders(compatibilityHeaders)
      .build();
    return RestClients.create(clientConfiguration).rest();
  }

As said we do many ingest and query operations ... as example:

    final BoolQueryBuilder boolQuery = QueryBuilders
      .boolQuery()
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2))
      .filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lte(s3));
    final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
    nsq.addSort(Sort.by(Direction.DESC, CREATED_SEARCH_FIELD));
    nsq.setMaxResults(size);

We do also updateByQuery operations ... like this:

    final BoolQueryBuilder boolQuery = QueryBuilders.boolQuery()
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_1, s1))
      .filter(QueryBuilders.rangeQuery(SEARCH_FIELD_3).lt(s3))
      .filter(QueryBuilders.matchQuery(SEARCH_FIELD_2, s2));
    final NativeSearchQuery nsq = new NativeSearchQuery(boolQuery);
    return UpdateQuery.builder(nsq)
      .withScriptType(ScriptType.INLINE)
      .withScript(UPDATE_SCRIPT)
      .withParams(UPDATE_PARAMS)
      .build();

update script looks like this:

"ctx._source.FIELD_4 = params.FIELD_4; ctx._source.FIELD_5 = params.FIELD_5; ctx._source.FIELD_6 = params.FIELD_6; ctx._source.FIELD_3 = params.FIELD_3"

finally we do deleteByQuery operations with the same query as update operations.
Of course no script in that case.

ISSUE

All operations run like a charm except deleteByQuery. At the moment deleteByQuery is enabled (even these being just a fraction of the traffic, even when there are much more UPDATE operations) the cluster starts to get into problems. ALL delete operations timeout, although the records are removed from the cluster. The fielddata cache starts to grow significantly, eventually causing the GC usage and duration to spike, eventually causing the CPU to spike, and finally causing the circuit breaker [parent] to be triggered starting to respond 429 TOO MANY REQUEST to our operations.

This is no matter of the size of the result of the delete query, delete queries bringing just 1 o 2 documents cause the same effect.
Please remember that the amount of deleted queries is small.

This only happens on deletes. If I replace deletes with updates (using the same query and a script that updates four fields) the cluster is stable. This alone is very weird to me since updates are expected to be more expensive than updates.

NOTE If I bypass spring-data-elasticsearch and use feign client sending POST HTTP requests directly without the RestHighLevelClient for the delete operations, then the cluster is stable. This leads me to think that there might be something wrong with the deletes that RestHighLevelClient is sending. It feels like something is not closed (connection timeout).

Here are some screenshots:

Timeout exception on ALL delete operations

org.springframework.dao.DataAccessResourceFailureException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]; nested exception is java.lang.RuntimeException: 5,000 milliseconds timeout on connection http-outgoing-222 [ACTIVE]
	at org.springframework.data.elasticsearch.core.ElasticsearchExceptionTranslator.translateExceptionIfPossible(ElasticsearchExceptionTranslator.java:75)
	at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.translateException(ElasticsearchRestTemplate.java:402)
	at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.execute(ElasticsearchRestTemplate.java:385)
	at org.springframework.data.elasticsearch.core.ElasticsearchRestTemplate.delete(ElasticsearchRestTemplate.java:224)
	at com.xxx.xxx.service.xxx.deleteByQuery(xxx.java:380)

Metrics when deletes are enabled
(we disable updates at the same time so 100% of the spikes are related to deletes)

The text was updated successfully, but these errors were encountered:

sothawo · 2023-05-14T14:50:53Z

Might be worth to try an add an intercepting proxy to the setup to capture the exact request that is sent out by the delete by query request.

Spring Data Elasticsearch 4.2 is outdated and out of maintenance for over one year now. The last version of the 4.x releases (4.4.x) has reached EOL last week.

When looking at the code in the 5.0 branch that uses the then already deprecated RestHighLevelClient I can see that the refresh parameter for the delete request is set to true, that might be causing the problem.

Can you reproduce this in a setup using the maintained versions (5.1 or 5.0), they both still allow the old client to be used, or better, can you switch to a supported version and use the current Elasticsearch client?

blacar · 2023-05-15T10:49:14Z

yeah ... if refresh parameter is set then I can understand that every delete request might be triggering an index refresh which is very likely the reason for the overload. I don't know the relation between the index refresh operation and the fielddata cache, but that's something from the elastic side.

I would try the maintained versions, but if the refresh param is there I would expect the same behavior. I will stay on feign for deletes until I am ready to switch to the current Elasticsearch client.

I will ping back if I find something more.

Thxs!

spring-projects-issues added the status: waiting-for-triage An issue we've not yet triaged label May 12, 2023

sothawo added status: waiting-for-feedback We need additional information before we can continue and removed status: waiting-for-triage An issue we've not yet triaged labels May 14, 2023

spring-projects-issues added status: feedback-provided Feedback has been provided and removed status: waiting-for-feedback We need additional information before we can continue labels May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Delete by query causing fielddata cache spike leading to 429 #2550

Delete by query causing fielddata cache spike leading to 429 #2550

blacar commented May 12, 2023

sothawo commented May 14, 2023

blacar commented May 15, 2023

Delete by query causing fielddata cache spike leading to 429 #2550

Delete by query causing fielddata cache spike leading to 429 #2550

Comments

blacar commented May 12, 2023

CONTEXT

ISSUE

sothawo commented May 14, 2023

blacar commented May 15, 2023