Fix for #1520: Add support for retry on transactions for resharding #2062

dusty9023 · 2022-04-03T18:56:01Z

A proposed solution for #1520. I ask that the proposed change be thoroughly reviewed by those who are experienced with the repo, and the Redis Transaction code in the library.

Investigation:
In order to investigate, I took one of the many transaction errors we saw when increasing/decreasing the number of shards in our cluster, and used the associated keys + nodes to reply the the exact same transaction using the debugger.

The error for the transaction EXEC was the same as what we saw in our logs:

EXECABORT Transaction discarded because of previous errors.

Looking at the InnerOperations of the TransactionMessage, I was surprised to find that the ResultBox was null, and the result was already lost by the time the SetResult() for transaction message was called.

I eventually tracked the cause down to base Message class' Complete() method. This removed the resultbox to ensure it couldn't be called twice, but it also mean't that the result of the InnerOperations result aren't available when the transaction result is being processed.

This required additional changes to ensure that the result of the InnerOperations were available to the TransactionMessage when it was processing the result, but that they were also cleaned up when the TransactionMessage itself was complete.

Digging into the ResultProcessor, also showed that it RELIED on the RawResult error message following a very particular format in order to detect, and thus retry, messages where the hashslot had been moved.

Based on these facts, the RawResult error message had to be edited to include:

prefix the "MOVED (HASHSLOT) (endpoint) " to the entire raw result error if we determine that the ONLY reason that the transaction failed was due to a single hashslot being moved.
Also aggregate all the InnerOperation errors/fault messages, and append them to the existing RawResult from the EXEC call.

This required changes to the TransactionMessage and QueueMessage's lifecycle (Complete() method), relying heavily on the TransactionMessage Complete() to mark the QueueMessages actually completed.

Important Notes:

I attempted to write unit tests for these changes that didn't rely on a cluster, but most of the nested classes involved are private. Changing those to internal, also revealed that many of the dependencies, such as ServerEndPoint, PhysicalBridge, PhysicalConnection... are all sealed and cannot be mocked using MOQ.
In order to test these changes, you require a REDIS cluster that has at least two primaries. One that had keys written to it originally, then the hashslot with that key is moved to the other primary.
I've also added/updated a functional test to cover this case.
I noticed the "src/StackExchange.Redis/PublicAPI.Unshipped.txt", and "src/StackExchange.Redis/PublicAPI.Shipped.txt", but I'm not aware of what process you follow for moving an API from being considered Unshipped to Shipped.

Due to the challenges in testing, I resorted to the following for testing:

Taking a transaction that had failed in our pre-production cluster due to re-sharding, and then retrying it again using these changes.
After proving the solution worked on a smaller scale, we deployed the change to our pre-production environment, and scaled our cluster (resharding) from 25 to 30 shards, while having active traffic on the cluster.

During this time we saw no errors in our logs.

… resharding

- Tests the use case where a transaction is run against the wrong primary node, and it automatically redirects to the correct one - RedisHashslotMigratedAndNoRedirectException is thrown if a NoRedirect is thrown

Attempting another method for forcing a transaction to run against a specific endpoint

dusty9023 · 2022-04-03T22:31:01Z

src/StackExchange.Redis/RedisTransaction.cs

@@ -56,6 +58,12 @@ public Task<bool> ExecuteAsync(CommandFlags flags)
            return base.ExecuteAsync(msg, proc); // need base to avoid our local wrapping override
        }

+        internal bool ExecuteInternal(CommandFlags flags, ServerEndPoint endpoint = null)


Not a huge fan of this, but I also didn't want to change the public API of this for the sake of making a functional test work.

dusty9023 · 2022-04-03T22:42:12Z

tests/StackExchange.Redis.Tests/Cluster.cs

@@ -188,7 +188,69 @@ static string StringGet(IServer server, RedisKey key, CommandFlags flags = Comma
                    string e = StringGet(conn.GetServer(node.EndPoint), key);
                    Assert.Equal(value, e); // wrong replica, allow redirect

-                    var ex = Assert.Throws<RedisServerException>(() => StringGet(conn.GetServer(node.EndPoint), key, CommandFlags.NoRedirect));
+                    var ex = Assert.Throws<RedisHashslotMigratedAndNoRedirectException>(() => StringGet(conn.GetServer(node.EndPoint), key, CommandFlags.NoRedirect));


I'm not sure how much of an impact switching from RedisServerException to RedisHashslotMigratedAndNoRedirectException will have; it's a pretty specific criteria and error to cause this.

One strategy that could be used to mitigate the impact would be to switch RedisHashslotMigratedAndNoRedirectException be a subclass of RedisServerException (requires unsealing the class).

dusty9023 · 2022-04-03T22:45:55Z

src/StackExchange.Redis/RedisTransaction.cs

+                // still need to activate continuations for GetMessages(),
+                // which might be waiting for the last innerOperation to
+                // complete.
+                ResultBox?.ActivateContinuations();


If I understood the logic behind this correct, in order for the last section of the GetMessage() logic to work, where it waits on the last result box of the innerOperations, this needs to be fired.

This ultimately means that in the lifetime of a QueuedMessage, it will fire/pulse twice. Once, when it's marked Complete() after it finishes it's call to REDIS, and then again when TransactionComplete() is called when the TransactionMessage runs it's Complete(). (which calls TransactionComplete() on all its innerOperations -- calling base.Complete()).

NickCraver · 2022-04-04T12:55:24Z

Hey @dusty9023 - I won't have time to get to this until tonight at the earliest for a really in-depth pass (this needs some thinking time), but wanted to say thanks a ton for a great write-up here. We'll evaluate and see if this works as-is or if we need some tweaks but your investigation and notes are a tremendous help and time save here - thank you!

NickCraver · 2022-04-05T01:59:02Z

Alrighty, trying to grok the issue/changes here (thanks a ton for descriptions and example code!). I understand what you're hitting, but would want to take a much simpler approach to solving it if possible - hope to get to it this week but depends what all else eats time (trying to get a few things in).

Dustin Durand added 6 commits April 3, 2022 12:05

Fix for StackExchange#1520: Add support for retry on transactions for…

9d35473

… resharding

Updating Test Case with new RedisHashslotMigratedAndNoRedirectException

e9525e0

Updating Test Case with new RedisHashslotMigratedAndNoRedirectException

f5f8190

Adding unit test for transactions run on the wrong primary

b76f6bf

- Tests the use case where a transaction is run against the wrong primary node, and it automatically redirects to the correct one - RedisHashslotMigratedAndNoRedirectException is thrown if a NoRedirect is thrown

Correcting issue with new Unit test IntentionalWrongServerForTransaction

23d9972

Further tweaks to IntentionalWrongServerForTransaction unit test

387490e

Attempting another method for forcing a transaction to run against a specific endpoint

dusty9023 commented Apr 3, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for #1520: Add support for retry on transactions for resharding #2062

Fix for #1520: Add support for retry on transactions for resharding #2062

dusty9023 commented Apr 3, 2022 •

edited

dusty9023 Apr 3, 2022

dusty9023 Apr 3, 2022

dusty9023 Apr 3, 2022 •

edited

NickCraver commented Apr 4, 2022

NickCraver commented Apr 5, 2022

Fix for #1520: Add support for retry on transactions for resharding #2062

Are you sure you want to change the base?

Fix for #1520: Add support for retry on transactions for resharding #2062

Conversation

dusty9023 commented Apr 3, 2022 • edited

dusty9023 Apr 3, 2022

Choose a reason for hiding this comment

dusty9023 Apr 3, 2022

Choose a reason for hiding this comment

dusty9023 Apr 3, 2022 • edited

Choose a reason for hiding this comment

NickCraver commented Apr 4, 2022

NickCraver commented Apr 5, 2022

dusty9023 commented Apr 3, 2022 •

edited

dusty9023 Apr 3, 2022 •

edited