Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uneven distribution of docs across shards, even with auto-generated ids #8041

Open
EmilBode opened this issue Feb 16, 2024 · 2 comments
Open
Labels
7.x Relates to a 7.x client version Category: Question

Comments

@EmilBode
Copy link

Elasticsearch Version

7.17.15

Installed Plugins

No response

Java Version

bundled

OS Version

Ubuntu 20..04.6 LTS

Problem Description

We've been running several parallel processes that all sent bulk-indexes to an index. The documents from a single process now seem to be very unevenly distributed across our shards.
Looking at one part, I find that GET indexname/_count?preference=_shards: gives results ranging from 2215 to 143810 documents on a single shard.

Steps to Reproduce

Index creation

PUT myindex
{
  "settings": {
    "number_of_shards": 20,
    "number_of_replicas": 1,
    "refresh_interval": "300s",
    "routing": {
      "allocation": {
        "include": {
          "_tier_preference": "data_warm,data_hot"
        }
      }
    }
  }
}

Bulk indexing

Spin up 6 different .NET projects, who all use the NEST-client, to bulk-index documents:

ElasticClient = new ElasticClient(connnectionSettings)
var results = ElasticClient.BulkAll(objects, b=>b.Index(myindex).
    .BufferToBulk((descriptor, list) => 
        {foreach(var obj in list) {descriptor.Index(i => i.Document(obj))
    .RefreshOnCompleted(false)
    .MaxDegreeOfParallelism(4)
    .Size(10))

Expected behavior

Even distribution of all documents, also meaning the documents from process 1 are evenly spread, docs from process 2 are evenly spread, etc.

Observed behavior

While looking at all documents together, the spread is reasonably, but when just looking at documents from a single process, they disproportionately end up at a few shards

Logs (if relevant)

No response

@DaveCTurner DaveCTurner transferred this issue from elastic/elasticsearch Feb 19, 2024
@DaveCTurner
Copy link

Hi @EmilBode and thanks for your interest.

I can't reproduce this myself with pure Elasticsearch: if you don't specify document IDs then Elasticsearch generates them automatically in a manner that distributes documents across shards evenly. Therefore it seems that either there's something about how you're using the client which is causing it to specify document IDs (with a very skewed distribution for some reason) or the client is itself generating those document IDs. I can't tell which without digging into the .NET client code which I'm not set up to do, so I've transferred this to the elasticsearch-net repository for the attention of the .NET client folks.

@EmilBode
Copy link
Author

Thanks for looking into this.

For what it's worth, the IDs of the documents look like auto-generated IDs to me: 20 characters of a kind of base64-encoding (I'm seeing A-Z, a-z, 0-9, "-" and sometimes a leading underscore).

@flobernd flobernd added 7.x Relates to a 7.x client version question labels Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
7.x Relates to a 7.x client version Category: Question
Projects
None yet
Development

No branches or pull requests

3 participants