JAVA-3051: Memory leak #1743

SiyaoIsHiding · 2023-11-02T22:13:54Z

No description provided.

SiyaoIsHiding · 2023-11-06T20:21:41Z

We explored several alternatives for unit testing this memory leak fix but decided to give up testing it, as we could not find a reliable way to test the garbage collector's behavior.

The unit test we wrote and considered:

cassandra-java-driver/core/src/test/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapperMemoryLeakTest.java

Lines 118 to 142 in cb61eb8

    
           public void should_garbage_collect_without_strong_references() { 
        
               // given that 
        
               given(nodeDistanceEvaluator.evaluateDistance(weakNode1, null)).willReturn(NodeDistance.IGNORED); 
        
               given(nodeDistanceEvaluator.evaluateDistance(weakNode2, null)).willReturn(NodeDistance.IGNORED); 
        
               // weak references to poke the private WeakHashMap in LoadBalancingPolicyWrapper.distances 
        
               WeakReference<DefaultNode> weakReference1 = new WeakReference<>(weakNode1); 
        
               WeakReference<DefaultNode> weakReference2 = new WeakReference<>(weakNode2); 
        
               wrapper.init(); 
        
               // remove all the strong references, including the ones held by Mockito 
        
               weakNode2 = null; 
        
               reset(metricsFactory); 
        
               reset(distanceReporter); 
        
               reset(nodeDistanceEvaluator); 
        
               reset(metadata); 
        
               // verify 
        
               System.gc(); 
        
               assertThat(weakReference1.get()).isNotNull(); 
        
               await().atMost(10, TimeUnit.SECONDS) 
        
                       .until(() -> weakReference2.get() == null); 
        
           }

This test:

creates two DefaultNode
creates two WeakReference pointing to the nodes, just to poke their existence later
initializes the policy
clear all the strong references
requests for garbage collection
verify the node is collected

We checked that

In my local environment (Zulu 8.72.0.17-CA-macos-aarch64), this test will succeed, and if I revert the changes from WeakHashMap to the strong HashMap, this test will fail.
Before all the strong references are cleared, poking the memory, the referring objects to the weakNode2 are:
a. weakNode2 and weakReference2
b. InterceptedInvocations by Mockito
c. HashMap of allNodes stored in when(metadata.getNodes()).thenReturn(allNodes);
d. wanted in Equals statement in await().atMost(10, TimeUnit.SECONDS).until(() -> weakReference2.get() == null);
These are all expected and no reference is leaked.
If evaluateDistance of nodes does not return IGNORED, they will be stored in BasicLoadBalancingPolicy.liveNodes. But nodes there can be removed later by onDown or onRemoved. We suppose this is intended.

We considered that

According to this post, System.gc() is more like a request/hint that some JVM will ignore, and there is no reliable way to force garbage collection. This means the test above may fail in other environments, but the last thing we want is a flaky test.

We think workarounds like generating a huge amount of garbage to trigger garbage collection may not be worth it, either.

Therefore, we concluded that no test may be the best choice for now, and the checks we perform above may be sufficient.

hhughes

This change will allow entries to be be dropped from LoadBalancingPolicyWrapper#distances but I'm not entirely convinced there aren't other places where strong references to the Node object will remain.

In DefaultLoadBalancingPolicy there are responseTimes and upTimes maps which use Node as the key with a ConcurrentHashMap and I don't see where entries are ever removed so this likely will continue to hold these references (although with upTimes it doesn't look like items are ever added).

ControlConnection maintains two weak hash maps - lastDistanceEvents and lastStateEvents - where both the value types DistanceEvent and NodeStateEvent hold a hard reference to the Node, per the WeakHashMap docs it looks like this will prevent the entries being cleaned up:

The value objects in a WeakHashMap are held by ordinary strong references. Thus care should be taken to ensure that value objects do not strongly refer to their own keys, either directly or indirectly, since that will prevent the keys from being discarded

Likely there are more places too.

I think it could be good to set up at least a one-off test which reproduces the leak from the original ticket and confirm that this change (and possibly the others mentioned above) successfully prevent the leak before marking this one as completed.

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

…eTimes

hhughes

Looks good! Couple of minor pieces of feedback, most importantly I think we want to avoid logging at info when ignoring events which have lost their node reference as this might end up creating a lot of unnecssary log churn

hhughes · 2023-11-17T20:36:16Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

-    return true;
+    AtomicLongArray array = responseTimes.getIfPresent(node);
+    if (array == null) return true;
+    else if (array.length() == 2) {


nit: consider collapsing the null check into this conditional so there is only one irregular state return value (return true)

hhughes · 2023-11-20T19:36:08Z

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java

-            policy.onUp(event.node);
+          DefaultNode node = event.node.get();
+          if (node == null) {
+            LOG.info("[{}] Node for this event was removed, ignoring: {}", logPrefix, event);


Info-level log might be a bit high for this notice as there isn't really action the user should take when this happens. Consider dropping to debug/trace

hhughes · 2023-11-20T19:44:37Z

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java

-          if (event.newState == NodeState.UP) {
-            policy.onUp(event.node);
+          DefaultNode node = event.node.get();
+          if (node == null) {


nit: do we need to re-perform the null for every policy? Is there a good reason not to pull this out of the loop?

hhughes · 2023-11-20T19:47:54Z

core/src/main/java/com/datastax/oss/driver/internal/core/metadata/NodeStateEvent.java

@@ -53,10 +54,10 @@ public static NodeStateEvent removed(DefaultNode node) {
   */
  public final NodeState newState;

-  public final DefaultNode node;
+  public final WeakReference<DefaultNode> node;


Nit: Since we're changing the type here I'm wondering if it might be cleaner to provide a @nullable getter for DefaultNode, rather than exposing the weak reference directly. Same comment for DistanceEvent.node.

hhughes · 2023-11-20T19:52:12Z

core/src/main/java/com/datastax/oss/driver/internal/core/session/DefaultSession.java

-        context.getNodeStateListener().onRemove(event.node);
+      DefaultNode node = event.node.get();
+      if (node == null) {
+        LOG.info(


Nit: I think info-level is too high here

hhughes · 2023-11-20T19:52:43Z

.../main/java/com/datastax/oss/driver/internal/metrics/micrometer/MicrometerMetricsFactory.java

@@ -119,15 +120,22 @@ public NodeMetricUpdater newNodeUpdater(Node node) {
  }

  protected void processNodeStateEvent(NodeStateEvent event) {
+    DefaultNode node = event.node.get();
+    if (node == null) {
+      LOG.info(


Nit: I think info-level is too high here

hhughes · 2023-11-20T19:52:50Z

...n/java/com/datastax/oss/driver/internal/metrics/microprofile/MicroProfileMetricsFactory.java

@@ -121,16 +122,22 @@ public NodeMetricUpdater newNodeUpdater(Node node) {
  }

  protected void processNodeStateEvent(NodeStateEvent event) {
+    DefaultNode node = event.node.get();
+    if (node == null) {
+      LOG.info(


Nit: I think info-level is too high here

hhughes

LGTM

aratno

Jira link for my own reference: https://datastax-oss.atlassian.net/browse/JAVA-3051

It would be helpful to have logs from the repro that the user submitted, specifically the logs from com.datastax.oss.driver.internal.core.metadata.NodeStateManager.SingleThreaded#setState that look like "[{}] Transitioning {} {}=>{} (because {})".

This seems like a bug in AWS Keyspaces, since each node includes itself in system.peers, which is unexpected to the driver according to the user's report:

[s0] Control node has an entry for itself in system.peers: this entry will be ignored. This is likely due to a misconfiguration; please verify your rpc_address configuration in cassandra.yaml on all nodes in your cluster.

I left a few comments but otherwise these seem like generally positive changes. I agree that it's difficult to write unit tests for memory leaks like these, especially without any scaffolding around heapdump capture or parsing. I'm a bit concerned that there may be paths where a node event may not have any strong reference and is then garbage-collected and ignored by handlers, rather than surviving long enough to serve its purpose.

aratno · 2024-02-05T22:41:21Z

...rc/main/java/com/datastax/oss/driver/internal/core/util/concurrent/ReplayingEventFilter.java

@@ -82,6 +82,7 @@ public void markReady() {
        consumer.accept(event);
      }
    } finally {
+      recordedEvents.clear();


What's the reasoning for this change?

ReplayingEventFilter works like a buffer. It holds a list (queue) of Events since its state is STARTED, and consumes all of them when its state becomes READY all at once. However, the list of the events is never cleared. They leak strong references for the nodes.

aratno · 2024-02-05T22:43:12Z

core/src/main/java/com/datastax/oss/driver/internal/core/metrics/AbstractMetricUpdater.java

+              clearMetrics();
+              cancelMetricsExpirationTimeout();


What's the reasoning for this change?

This is a lambda for timeout. Even after the timeout and lambda triggered, the Timer object is not collected and it still holds a reference to this, until it's canceled.

aratno · 2024-02-05T22:55:08Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+      long threshold = now - RESPONSE_COUNT_RESET_INTERVAL_NANOS;
+      long leastRecent = array.get(0);
+      return leastRecent - threshold < 0;
+    } else return true;


Style nit: Invert the condition and use an early-return if response rate is insufficient, so you don't have else return true

aratno · 2024-02-05T23:01:23Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

  protected final Map<Node, Long> upTimes = new ConcurrentHashMap<>();
  private final boolean avoidSlowReplicas;

  public DefaultLoadBalancingPolicy(@NonNull DriverContext context, @NonNull String profileName) {
    super(context, profileName);
    this.avoidSlowReplicas =
        profile.getBoolean(DefaultDriverOption.LOAD_BALANCING_POLICY_SLOW_AVOIDANCE, true);
+    CacheLoader<Node, AtomicLongArray> cacheLoader =


Style nit: use a separate class for the cache value here, rather than using AtomicLongArray as a generic container. Seems like it can be something like NodeResponseRateSample, with methods like boolean hasSufficientResponses. I see this was present in the previous implementation, so not a required change for this PR, just something I noticed.

absurdfarce · 2024-02-05T23:22:25Z

Very much agreed that the underlying issue here appears to be an issue with AWS Keyspaces @aratno; that's being addressed in a different ticket. The scope of this change is around preventing the (potentially indefinite) caching of Node instances within an LBP.

aratno · 2024-02-06T02:49:29Z

...ain/java/com/datastax/oss/driver/internal/core/loadbalancing/DefaultLoadBalancingPolicy.java

+            return array;
+          }
+        };
+    this.responseTimes = CacheBuilder.newBuilder().weakKeys().build(cacheLoader);


I think we should add a RemovalListener here.

If a GC happens and response times for a Node are purged, then we'll end up treating that as "insufficient responses" in isResponseRateInsufficient, which can lead us to mark a node as unhealthy. I recognize that this is a bit of a pathological example, but this behavior does depend on GC timing and would be a pain to track down, so adding logging could make someone's life easier down the line.

Thank you for your review. Would you please explain more about this?
If GC collects a node, that means the node is gone. If the node is gone, why do we care about whether it's treated as healthy or not?
Anyway, for RemovalListener, do you mean sth like this?

this.responseTimes = CacheBuilder.newBuilder().weakKeys().removalListener( (RemovalListener<Node, AtomicLongArray>) notification -> LOG.trace("[{}] Evicting response times for {}: {}", logPrefix, notification.getKey(), notification.getCause())) .build(cacheLoader);

@aratno Hi Abe, thank you for your review. Is there any update?

memory leak inital change and test

1deb2f3

SiyaoIsHiding marked this pull request as ready for review November 2, 2023 22:14

SiyaoIsHiding added 2 commits November 2, 2023 15:46

remove wildcard import

cb61eb8

delete test

4f9d779

hhughes reviewed Nov 6, 2023

View reviewed changes

...src/main/java/com/datastax/oss/driver/internal/core/metadata/LoadBalancingPolicyWrapper.java Outdated Show resolved Hide resolved

SiyaoIsHiding added 6 commits November 13, 2023 11:03

format

7042177

For the memory leak of DefaultLoadBalancingPolicy.responseTimes

2f23cae

For the memory leak of DistanceEvent.node

6fa4cb2

format

e443882

WIP: For memory leak from NodeStateEvent.node

1588ebd

remove wildcard

35eab99