HDDS-10715. Remove Decommision nodes on replication #6558

symious · 2024-04-19T08:39:18Z

What changes were proposed in this pull request?

For containers with insufficient replicas, new target will be chose for the replication.

For decommissioned nodes, although it will be excluded at last, but it will waste the "retry count", by default it's 3, which will cause containers can not be replicated when a cluster has many nodes in decommission state.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-10715

How was this patch tested?

Existing tests.

sodonnel · 2024-04-19T08:59:40Z

My first thought on this, is that if the placement policy already excludes decommissioning nodes via the isValidateNode method at the end, could we simply add all decommissioning nodes (and we probably need to include manintenance nodes too) to the exclude list at the start in the placement policy. That way it would fix it for all callers in one place.

Eg we already have this as the entry point in SCMCommonPlacementPolicy and it has access to NodeManager I think, so we could add to the exclude list right at the start:

  @Override
  public final List<DatanodeDetails> chooseDatanodes(
          List<DatanodeDetails> usedNodes,
          List<DatanodeDetails> excludedNodes,
          List<DatanodeDetails> favoredNodes,
          int nodesRequired, long metadataSizeRequired, long dataSizeRequired)
          throws SCMException {
/*
  This method calls the chooseDatanodeInternal after fixing
  the excludeList to get the DatanodeDetails from the node manager.
  When the object of the Class DataNodeDetails is built from protobuf
  only UUID of the datanode is added which is used for the hashcode.
  Thus not passing any information about the topology. While excluding
  datanodes the object is built from protobuf @Link {ExcludeList.java}.
  NetworkTopology removes all nodes from the list which does not fall under
  the scope while selecting a random node. Default scope value is
  "/default-rack/" which won't match the required scope. Thus passing the proper
  object of DatanodeDetails(with Topology Information) while trying to get the
  random node from NetworkTopology should fix this. Check HDDS-7015
 */
    return chooseDatanodesInternal(validateDatanodes(usedNodes),
            validateDatanodes(excludedNodes), favoredNodes, nodesRequired,
            metadataSizeRequired, dataSizeRequired);
  }

This reverts commit 6a52fe6.

symious · 2024-04-19T16:23:42Z

@sodonnel IMHO, expanding the excludedNodes list within the implementation of the chooseDatanodes method may indeed deviate from the original intent of the interface. Modifying these parameters in the implementation could confuse users of the interface, as they generally do not expect the parameters they pass to be altered.

sodonnel · 2024-04-19T16:39:45Z

But it is wrong for the placement policy to return a decommissioning node. It does indeed filter out any decommission nodes it finds and retries. So it is "removing them" but in a sub-optimal way.

The caller should not need to know all illegal nodes it needs to pass into the placement policy. What if there is another illegal node type in the future? Then we have to modified all the callers, rather than a single place.

You are probably correct that it is not good to modify the passed parameter list, but we can copy it into a new list that is used inside the placement policy and add to that copy.

sodonnel · 2024-04-19T16:44:10Z

I think there are other places in the code where the placement policies are called too. Eg pipeline creation, which could run into the same sort of problems I think. If we fix it inside the placement policy then it covers all existing scenarios.

symious · 2024-04-23T08:51:14Z

I think there are other places in the code where the placement policies are called too. Eg pipeline creation, which could run into the same sort of problems I think. If we fix it inside the placement policy then it covers all existing scenarios.

@sodonnel Agreed. Updated the PR, PTAL.

sodonnel · 2024-04-23T11:56:21Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/SCMCommonPlacementPolicy.java

            metadataSizeRequired, dataSizeRequired);
  }

+  private List<DatanodeDetails> expandExcludes(List<DatanodeDetails> original) {
+    Set<DatanodeDetails> expandedExcludes = new HashSet<>(original);
+    List<DatanodeDetails> list1 = nodeManager.getNodes(NodeOperationalState.DECOMMISSIONING, null);


We probably need to included the two maintenance states too, which means 4 iterations across all nodes in the cluster to find them. As this is a somewhat hot path, especially for EC pipeline allocation I wonder if this will hurt performance on larger clusters.

Partly this is due to the NodeManager interface - there is no method to find all nodes not in_service in a single iteration, which is really what we want.

At the end of the selection, we have to check if the nodes are still good, have enough space etc.

I wonder if we could do something smarter with the retry count to make it try more times on a larger cluster?

symious · 2024-04-29T08:32:55Z

@sodonnel PTAL.

sodonnel · 2024-04-29T13:59:08Z

hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/NodeManager.java

+   * @param health - The health of the node
+   * @return List of Datanodes that are Heartbeating SCM.
+   */
+  default List<DatanodeDetails> getNodes(


This new method does not address the performance concern I had. It is basically calling the original getNodes() method for each of the 4 out of service states. Each of those calls has to iterate all nodes in the cluster and return a set of the ones which are out of service.

The nodes picked by the policy have to be checked again before they are returned. While I originally suggested this solution, I am not sure it is a good one. It may be better to look at the retry count, and allow more retries if the failure reason is that the node is not in service. Or have a larger retry count if the cluster is large etc. At least then, the common case, which is no nodes out of service, does not pay a performance penalty on every call.

@sodonnel IMHO, the current changes should be a long-term solution, the change of maxRetries is currently being used in our cluster, but not a good way when the cluster is getting bigger.

For the getNodes part, we should improve the performance for the general usages.

sumitagrawl

@symious Thanks for working over this, have minor comment

sumitagrawl · 2024-05-15T12:34:34Z

...rg/apache/hadoop/hdds/scm/container/placement/algorithms/SCMContainerPlacementRackAware.java

-          "TotalNode = " + datanodeCount + " RequiredNode = " + nodesRequired +
-          " ExcludedNode = " + excludedNodesCount +
-          " UsedNode = " + usedNodesCount, null);
+    Set<DatanodeDetails> unavailableNodes = JavaUtils.unionOfCollections(usedNodes, excludedNodes);


unavailableNodes is not used, just its count. so also can get unavailableCount = usedNodesCount + excludedNodesCount
currently, unionOfCollection seems un-necessary for this case

HDDS-10715. Remove Decommision nodes on replication

6a52fe6

symious requested review from sodonnel and adoroszlai April 19, 2024 08:39

Revert "HDDS-10715. Remove Decommision nodes on replication"

c607dc5

This reverts commit 6a52fe6.

symious added 3 commits April 22, 2024 14:14

HDDS-10715. Exclude decommissioned nodes

757eee6

HDDS-10715. Fix Tests

6ad2196

HDDS-10715. Use set

014aeef

sodonnel reviewed Apr 23, 2024

View reviewed changes

symious added 3 commits April 25, 2024 15:20

HDDS-10715. Exclude maintenance also

e755a43

HDDS-10715. Fix confliction

92d18b8

HDDS-10715. Fix test failure

1fc8779

sodonnel reviewed Apr 29, 2024

View reviewed changes

sumitagrawl reviewed May 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HDDS-10715. Remove Decommision nodes on replication #6558

HDDS-10715. Remove Decommision nodes on replication #6558

symious commented Apr 19, 2024

sodonnel commented Apr 19, 2024

symious commented Apr 19, 2024

sodonnel commented Apr 19, 2024

sodonnel commented Apr 19, 2024

symious commented Apr 23, 2024

sodonnel Apr 23, 2024

symious commented Apr 29, 2024

sodonnel Apr 29, 2024

symious May 2, 2024

sumitagrawl left a comment

sumitagrawl May 15, 2024

HDDS-10715. Remove Decommision nodes on replication #6558

Are you sure you want to change the base?

HDDS-10715. Remove Decommision nodes on replication #6558

Conversation

symious commented Apr 19, 2024

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

sodonnel commented Apr 19, 2024

symious commented Apr 19, 2024

sodonnel commented Apr 19, 2024

sodonnel commented Apr 19, 2024

symious commented Apr 23, 2024

sodonnel Apr 23, 2024

Choose a reason for hiding this comment

symious commented Apr 29, 2024

sodonnel Apr 29, 2024

Choose a reason for hiding this comment

symious May 2, 2024

Choose a reason for hiding this comment

sumitagrawl left a comment

Choose a reason for hiding this comment

sumitagrawl May 15, 2024

Choose a reason for hiding this comment