-
Notifications
You must be signed in to change notification settings - Fork 475
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HDDS-10715. Remove Decommision nodes on replication #6558
base: master
Are you sure you want to change the base?
Conversation
My first thought on this, is that if the placement policy already excludes decommissioning nodes via the Eg we already have this as the entry point in SCMCommonPlacementPolicy and it has access to NodeManager I think, so we could add to the exclude list right at the start:
|
This reverts commit 6a52fe6.
@sodonnel IMHO, expanding the excludedNodes list within the implementation of the chooseDatanodes method may indeed deviate from the original intent of the interface. Modifying these parameters in the implementation could confuse users of the interface, as they generally do not expect the parameters they pass to be altered. |
But it is wrong for the placement policy to return a decommissioning node. It does indeed filter out any decommission nodes it finds and retries. So it is "removing them" but in a sub-optimal way. The caller should not need to know all illegal nodes it needs to pass into the placement policy. What if there is another illegal node type in the future? Then we have to modified all the callers, rather than a single place. You are probably correct that it is not good to modify the passed parameter list, but we can copy it into a new list that is used inside the placement policy and add to that copy. |
I think there are other places in the code where the placement policies are called too. Eg pipeline creation, which could run into the same sort of problems I think. If we fix it inside the placement policy then it covers all existing scenarios. |
@sodonnel Agreed. Updated the PR, PTAL. |
metadataSizeRequired, dataSizeRequired); | ||
} | ||
|
||
private List<DatanodeDetails> expandExcludes(List<DatanodeDetails> original) { | ||
Set<DatanodeDetails> expandedExcludes = new HashSet<>(original); | ||
List<DatanodeDetails> list1 = nodeManager.getNodes(NodeOperationalState.DECOMMISSIONING, null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We probably need to included the two maintenance states too, which means 4 iterations across all nodes in the cluster to find them. As this is a somewhat hot path, especially for EC pipeline allocation I wonder if this will hurt performance on larger clusters.
Partly this is due to the NodeManager interface - there is no method to find all nodes not in_service in a single iteration, which is really what we want.
At the end of the selection, we have to check if the nodes are still good, have enough space etc.
I wonder if we could do something smarter with the retry count to make it try more times on a larger cluster?
@sodonnel PTAL. |
* @param health - The health of the node | ||
* @return List of Datanodes that are Heartbeating SCM. | ||
*/ | ||
default List<DatanodeDetails> getNodes( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This new method does not address the performance concern I had. It is basically calling the original getNodes() method for each of the 4 out of service states. Each of those calls has to iterate all nodes in the cluster and return a set of the ones which are out of service.
The nodes picked by the policy have to be checked again before they are returned. While I originally suggested this solution, I am not sure it is a good one. It may be better to look at the retry count, and allow more retries if the failure reason is that the node is not in service. Or have a larger retry count if the cluster is large etc. At least then, the common case, which is no nodes out of service, does not pay a performance penalty on every call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sodonnel IMHO, the current changes should be a long-term solution, the change of maxRetries is currently being used in our cluster, but not a good way when the cluster is getting bigger.
For the getNodes
part, we should improve the performance for the general usages.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@symious Thanks for working over this, have minor comment
"TotalNode = " + datanodeCount + " RequiredNode = " + nodesRequired + | ||
" ExcludedNode = " + excludedNodesCount + | ||
" UsedNode = " + usedNodesCount, null); | ||
Set<DatanodeDetails> unavailableNodes = JavaUtils.unionOfCollections(usedNodes, excludedNodes); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unavailableNodes is not used, just its count. so also can get unavailableCount = usedNodesCount + excludedNodesCount
currently, unionOfCollection seems un-necessary for this case
What changes were proposed in this pull request?
For containers with insufficient replicas, new target will be chose for the replication.
For decommissioned nodes, although it will be excluded at last, but it will waste the "retry count", by default it's 3, which will cause containers can not be replicated when a cluster has many nodes in decommission state.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-10715
How was this patch tested?
Existing tests.