New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Locking for ResignLeadership #20837
Conversation
arangod/Cluster/Maintenance.cpp
Outdated
// check is we have collected information about this prototype already | ||
auto it = prototypeShards.find(prototype); | ||
if (it == prototypeShards.end()) { | ||
auto protoCol = collections.get(prototype); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I had a hard time read protoCol
because I was intuitively thinking of protocol
(as in HTTP, HTTPS, etc.). I get that it means protoCollection
here, but naming it protoCol
is slightly ambiguous, so I suggest using protoCollection
instead.
auto col = vocbase->lookupCollection(s); | ||
if (col == nullptr) { | ||
std::stringstream error; | ||
error << "Failed to lookup local collection " << collection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably we should include the shard name in the error message here as well, to have more detail should this error happen, e.g.
"Failed to lookup shard s of local collection collection"
…rship-in-correct-order
Co-authored-by: Jan <jsteemann@users.noreply.github.com>
Scope & Purpose
While streaming transactions and aql queries carefully lock multiple shards in a deterministic order, the maintenance did not care about such a order and just locked shards when resigning in any order. In fact the maintenance has no idea about distributeShardsLike and handles each shard of the shard group individually. A move shard requests locking of all shards in a shard group. This can lead to deadlocks with aql queries, since the locks are acquired in different orders.