Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HDDS-10715. Remove Decommision nodes on replication #6558

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

symious
Copy link
Contributor

@symious symious commented Apr 19, 2024

What changes were proposed in this pull request?

For containers with insufficient replicas, new target will be chose for the replication.

For decommissioned nodes, although it will be excluded at last, but it will waste the "retry count", by default it's 3, which will cause containers can not be replicated when a cluster has many nodes in decommission state.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10715

How was this patch tested?

Existing tests.

@sodonnel
Copy link
Contributor

My first thought on this, is that if the placement policy already excludes decommissioning nodes via the isValidateNode method at the end, could we simply add all decommissioning nodes (and we probably need to include manintenance nodes too) to the exclude list at the start in the placement policy. That way it would fix it for all callers in one place.

Eg we already have this as the entry point in SCMCommonPlacementPolicy and it has access to NodeManager I think, so we could add to the exclude list right at the start:

  @Override
  public final List<DatanodeDetails> chooseDatanodes(
          List<DatanodeDetails> usedNodes,
          List<DatanodeDetails> excludedNodes,
          List<DatanodeDetails> favoredNodes,
          int nodesRequired, long metadataSizeRequired, long dataSizeRequired)
          throws SCMException {
/*
  This method calls the chooseDatanodeInternal after fixing
  the excludeList to get the DatanodeDetails from the node manager.
  When the object of the Class DataNodeDetails is built from protobuf
  only UUID of the datanode is added which is used for the hashcode.
  Thus not passing any information about the topology. While excluding
  datanodes the object is built from protobuf @Link {ExcludeList.java}.
  NetworkTopology removes all nodes from the list which does not fall under
  the scope while selecting a random node. Default scope value is
  "/default-rack/" which won't match the required scope. Thus passing the proper
  object of DatanodeDetails(with Topology Information) while trying to get the
  random node from NetworkTopology should fix this. Check HDDS-7015
 */
    return chooseDatanodesInternal(validateDatanodes(usedNodes),
            validateDatanodes(excludedNodes), favoredNodes, nodesRequired,
            metadataSizeRequired, dataSizeRequired);
  }

@symious
Copy link
Contributor Author

symious commented Apr 19, 2024

@sodonnel IMHO, expanding the excludedNodes list within the implementation of the chooseDatanodes method may indeed deviate from the original intent of the interface. Modifying these parameters in the implementation could confuse users of the interface, as they generally do not expect the parameters they pass to be altered.

@sodonnel
Copy link
Contributor

But it is wrong for the placement policy to return a decommissioning node. It does indeed filter out any decommission nodes it finds and retries. So it is "removing them" but in a sub-optimal way.

The caller should not need to know all illegal nodes it needs to pass into the placement policy. What if there is another illegal node type in the future? Then we have to modified all the callers, rather than a single place.

You are probably correct that it is not good to modify the passed parameter list, but we can copy it into a new list that is used inside the placement policy and add to that copy.

@sodonnel
Copy link
Contributor

I think there are other places in the code where the placement policies are called too. Eg pipeline creation, which could run into the same sort of problems I think. If we fix it inside the placement policy then it covers all existing scenarios.

@symious
Copy link
Contributor Author

symious commented Apr 23, 2024

I think there are other places in the code where the placement policies are called too. Eg pipeline creation, which could run into the same sort of problems I think. If we fix it inside the placement policy then it covers all existing scenarios.

@sodonnel Agreed. Updated the PR, PTAL.

metadataSizeRequired, dataSizeRequired);
}

private List<DatanodeDetails> expandExcludes(List<DatanodeDetails> original) {
Set<DatanodeDetails> expandedExcludes = new HashSet<>(original);
List<DatanodeDetails> list1 = nodeManager.getNodes(NodeOperationalState.DECOMMISSIONING, null);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need to included the two maintenance states too, which means 4 iterations across all nodes in the cluster to find them. As this is a somewhat hot path, especially for EC pipeline allocation I wonder if this will hurt performance on larger clusters.

Partly this is due to the NodeManager interface - there is no method to find all nodes not in_service in a single iteration, which is really what we want.

At the end of the selection, we have to check if the nodes are still good, have enough space etc.

I wonder if we could do something smarter with the retry count to make it try more times on a larger cluster?

@symious
Copy link
Contributor Author

symious commented Apr 29, 2024

@sodonnel PTAL.

* @param health - The health of the node
* @return List of Datanodes that are Heartbeating SCM.
*/
default List<DatanodeDetails> getNodes(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new method does not address the performance concern I had. It is basically calling the original getNodes() method for each of the 4 out of service states. Each of those calls has to iterate all nodes in the cluster and return a set of the ones which are out of service.

The nodes picked by the policy have to be checked again before they are returned. While I originally suggested this solution, I am not sure it is a good one. It may be better to look at the retry count, and allow more retries if the failure reason is that the node is not in service. Or have a larger retry count if the cluster is large etc. At least then, the common case, which is no nodes out of service, does not pay a performance penalty on every call.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sodonnel IMHO, the current changes should be a long-term solution, the change of maxRetries is currently being used in our cluster, but not a good way when the cluster is getting bigger.

For the getNodes part, we should improve the performance for the general usages.

Copy link
Contributor

@sumitagrawl sumitagrawl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@symious Thanks for working over this, have minor comment

"TotalNode = " + datanodeCount + " RequiredNode = " + nodesRequired +
" ExcludedNode = " + excludedNodesCount +
" UsedNode = " + usedNodesCount, null);
Set<DatanodeDetails> unavailableNodes = JavaUtils.unionOfCollections(usedNodes, excludedNodes);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unavailableNodes is not used, just its count. so also can get unavailableCount = usedNodesCount + excludedNodesCount
currently, unionOfCollection seems un-necessary for this case

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants