You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We've been running several parallel processes that all sent bulk-indexes to an index. The documents from a single process now seem to be very unevenly distributed across our shards.
Looking at one part, I find that GET indexname/_count?preference=_shards: gives results ranging from 2215 to 143810 documents on a single shard.
Spin up 6 different .NET projects, who all use the NEST-client, to bulk-index documents:
ElasticClient = new ElasticClient(connnectionSettings)
var results = ElasticClient.BulkAll(objects, b=>b.Index(myindex).
.BufferToBulk((descriptor, list) =>
{foreach(var obj in list) {descriptor.Index(i => i.Document(obj))
.RefreshOnCompleted(false)
.MaxDegreeOfParallelism(4)
.Size(10))
Expected behavior
Even distribution of all documents, also meaning the documents from process 1 are evenly spread, docs from process 2 are evenly spread, etc.
Observed behavior
While looking at all documents together, the spread is reasonably, but when just looking at documents from a single process, they disproportionately end up at a few shards
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered:
I can't reproduce this myself with pure Elasticsearch: if you don't specify document IDs then Elasticsearch generates them automatically in a manner that distributes documents across shards evenly. Therefore it seems that either there's something about how you're using the client which is causing it to specify document IDs (with a very skewed distribution for some reason) or the client is itself generating those document IDs. I can't tell which without digging into the .NET client code which I'm not set up to do, so I've transferred this to the elasticsearch-net repository for the attention of the .NET client folks.
For what it's worth, the IDs of the documents look like auto-generated IDs to me: 20 characters of a kind of base64-encoding (I'm seeing A-Z, a-z, 0-9, "-" and sometimes a leading underscore).
Elasticsearch Version
7.17.15
Installed Plugins
No response
Java Version
bundled
OS Version
Ubuntu 20..04.6 LTS
Problem Description
We've been running several parallel processes that all sent bulk-indexes to an index. The documents from a single process now seem to be very unevenly distributed across our shards.
Looking at one part, I find that
GET indexname/_count?preference=_shards:
gives results ranging from 2215 to 143810 documents on a single shard.Steps to Reproduce
Index creation
Bulk indexing
Spin up 6 different .NET projects, who all use the NEST-client, to bulk-index documents:
Expected behavior
Even distribution of all documents, also meaning the documents from process 1 are evenly spread, docs from process 2 are evenly spread, etc.
Observed behavior
While looking at all documents together, the spread is reasonably, but when just looking at documents from a single process, they disproportionately end up at a few shards
Logs (if relevant)
No response
The text was updated successfully, but these errors were encountered: