Uneven distribution of docs across shards, even with auto-generated ids #8041

EmilBode · 2024-02-16T12:14:01Z

Elasticsearch Version

7.17.15

Installed Plugins

No response

Java Version

bundled

OS Version

Ubuntu 20..04.6 LTS

Problem Description

We've been running several parallel processes that all sent bulk-indexes to an index. The documents from a single process now seem to be very unevenly distributed across our shards.
Looking at one part, I find that GET indexname/_count?preference=_shards: gives results ranging from 2215 to 143810 documents on a single shard.

Steps to Reproduce

Index creation

PUT myindex
{
  "settings": {
    "number_of_shards": 20,
    "number_of_replicas": 1,
    "refresh_interval": "300s",
    "routing": {
      "allocation": {
        "include": {
          "_tier_preference": "data_warm,data_hot"
        }
      }
    }
  }
}

Bulk indexing

Spin up 6 different .NET projects, who all use the NEST-client, to bulk-index documents:

ElasticClient = new ElasticClient(connnectionSettings)
var results = ElasticClient.BulkAll(objects, b=>b.Index(myindex).
    .BufferToBulk((descriptor, list) => 
        {foreach(var obj in list) {descriptor.Index(i => i.Document(obj))
    .RefreshOnCompleted(false)
    .MaxDegreeOfParallelism(4)
    .Size(10))

Expected behavior

Even distribution of all documents, also meaning the documents from process 1 are evenly spread, docs from process 2 are evenly spread, etc.

Observed behavior

While looking at all documents together, the spread is reasonably, but when just looking at documents from a single process, they disproportionately end up at a few shards

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2024-02-19T10:41:51Z

Hi @EmilBode and thanks for your interest.

I can't reproduce this myself with pure Elasticsearch: if you don't specify document IDs then Elasticsearch generates them automatically in a manner that distributes documents across shards evenly. Therefore it seems that either there's something about how you're using the client which is causing it to specify document IDs (with a very skewed distribution for some reason) or the client is itself generating those document IDs. I can't tell which without digging into the .NET client code which I'm not set up to do, so I've transferred this to the elasticsearch-net repository for the attention of the .NET client folks.

EmilBode · 2024-02-19T11:36:22Z

Thanks for looking into this.

For what it's worth, the IDs of the documents look like auto-generated IDs to me: 20 characters of a kind of base64-encoding (I'm seeing A-Z, a-z, 0-9, "-" and sometimes a leading underscore).

DaveCTurner transferred this issue from elastic/elasticsearch Feb 19, 2024

flobernd added 7.x Relates to a 7.x client version question labels Feb 19, 2024

flobernd added Category: Question and removed question labels Apr 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uneven distribution of docs across shards, even with auto-generated ids #8041

Uneven distribution of docs across shards, even with auto-generated ids #8041

EmilBode commented Feb 16, 2024

DaveCTurner commented Feb 19, 2024

EmilBode commented Feb 19, 2024

Uneven distribution of docs across shards, even with auto-generated ids #8041

Uneven distribution of docs across shards, even with auto-generated ids #8041

Comments

EmilBode commented Feb 16, 2024

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Index creation

Bulk indexing

Expected behavior

Observed behavior

Logs (if relevant)

DaveCTurner commented Feb 19, 2024

EmilBode commented Feb 19, 2024