Transient fault tolerance and implicit retry support discussion #399

Plasma · 2016-04-27T23:39:27Z

The Problem

We're a heavy user of Azure Redis Cache; and the platform will sometimes (eg once a month) reboot the underlying host OS for platform updates, causing our primary redis cache instance to go down. The secondary instance Azure runs takes over, but only after a moment of several command failures.

When these events happen, sockets are disconnected, commands fail, and timeouts momentarily occur and SE.Redis rightfully has to throw that exception.

Here's an example of the exceptions we may see during this time:

[RedisConnectionException: SocketFailure on SMEMBERS]
...
Message: [IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.]
System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult):174
StackExchange.Redis.PhysicalConnection.EndReading(IAsyncResult result):17
Message: [SocketException: An existing connection was forcibly closed by the remote host]
System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult):99

And:

Message: [RedisConnectionException: No connection is available to service this operation: EXEC]
StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource1 source, Exception unthrownException):0 System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw():12 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task):41 System.Runtime.CompilerServices.TaskAwaiter1.GetResult():11
... app code ...

And:

Message: [TimeoutException: Timeout performing EXEC, inst: 1, mgr: Inactive, err: never, queue: 102, qu: 0, qs: 102, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=3,Free=32764,Min=4,Max=32767), clientName: RD0003FFAD174E]
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server):949 StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server):34
StackExchange.Redis.RedisTransaction.Execute(CommandFlags flags):14

Solution proposal

Azure, Amazon and other cloud services all toe the line that things need to handle transient faults. SQL Azure drivers handle this with built-in retry support on the app-code side, but the recommended driver (SE.Redis) has no command retry support to shield app code from these transient faults.

It's not an easy problem to solve: Not all commands should necessarily be retryable as they are not idempotent (eg, INCR operations) that may or may not have succeeded.

I am wondering what the thoughts are about how to best approach a solution to this problem.

Driver level support by SE.Redis that could have a "command retry" option that blindly retries all commands (or optionally only "idempotent commands" like SET or GET) upon any form of connection/timeout failure for up to X retries would be a pretty good initial solution.

Ideally, there was perhaps a Redis-level "This server is shutting down" command that could warn the driver to pause sending any commands for a few moments while the underlying secondary takes over would be better, but that's a more co-ordinated solution also involving the Redis team.

Thoughts?

The text was updated successfully, but these errors were encountered:

sajad-deyargaroo · 2016-05-03T00:11:44Z

There is retry mechanism in SE.Redis, we can use below options in the connection string:

abortConnect=false
connectRetry=3000
connectTimeout=600000
syncTimeout=600000

Below is the sample connection string:

"contoso.redis.cache.windows.net,abortConnect=false,connectRetry=3000,connectTimeout=600000,syncTimeout=600000,ssl=true,password=weweweweweZNw1L4bIo0DgPxD9ytdwewe="

coronag · 2016-05-03T08:06:28Z

Using : StackExchange.Redis and ServiceStack.Redis

For documentation about the "connectRetry" parameter, I found this :
// RetryTimeout = 3000, (default 3000ms) // To improve the resilience of client connections, RedisClient will transparently retry failed Redis operations due to Socket and I/O Exceptions in an exponential backoff starting from 10ms up until the RetryTimeout of 3000ms. These defaults can be tweaked with: RedisConfig.DefaultRetryTimeout = 3000; RedisConfig.BackOffMultiplier = 10;

=> I already have the same issue at some times. But that was with the default value of "3000ms".

This will not solve the problem exposed (not a transiant failure mecanism) but could help a little ? .
I was wondering about the impact of changing that default value to a higher one (5 seconds for instance). I don't want it to decrease performance if this is happening at a higher rate than expected.

sajad-deyargaroo · 2016-05-03T11:20:33Z

connectRetryspecifies the number of connect attempts during initial connect and is not about the time, abortConnectspecifies whether retries should happen at all, connectTimeoutand sycTimeoutare for timeout of connect and sync operations respectively.

Also, you are using the Azure Redis Cache and the recommended client for the same is StackExchange.Redis but the documentation section that you pasted above seems to be from ServiceStack.Redis, can you please confirm, which Redis client you are using.

iangriffin · 2016-11-16T21:01:35Z

The original discussion was about retrying operations, not connections. It is my understanding that connectRetry, connectTimeout and abortConnect relate to retrying the actual connection, not the operation. Retrying the get/set operations would be extremely helpful and I'm currently looking for solutions before building one from scratch.

I can't find anything about syncTimeout.

ericoldre · 2017-08-01T19:15:32Z

bump for this question.
Looking for guidance on retrying SET or GET operations in the event there is a temporary network issue that causes a SocketException.

We are migrating from ServiceStack to StackExchange client and in the code we are replacing, which used ServiceStack, we caught exceptions and would retry operations after a short thread.sleep. On most occasions the retry would work.

If there is a network issue that causes a System.Net.SocketException such as "An established connection was aborted by the software in your host machine" or "An existing connection was forcibly closed by the remote host" does StackExchange.Redis automatically retry up until the syncTimeout time has elapsed?

If not, are there any suggested steps that should happen between the initial failure and a retry in our code? Such as:

recreating the multiplexer? (I'd guess not)
waiting a short amount of time?
calling Close() and Configure()?

Just for clarification, I am talking about network issues when attempting StringSet or StringGet. Not when trying to initially connect to the Redis server.

NickCraver · 2021-06-23T02:55:40Z

I know this is an old issue, and something we haven't gotten to, but @deepakverma is now working on approaches here. Expect some retry semantics to be configurable in an upcoming release.

Plasma · 2021-06-28T00:23:12Z

@NickCraver I believe the Azure Redis team have thought about retry handling and failover notification support for drivers before, just want to mention in case your team wanted to reach out to them to discuss ideas and come up a nice convention others can follow too to have a smoother planned failover occur.

NickCraver · 2021-06-28T00:24:20Z

@Plasma Deepak's on that team ;) We are indeed syncing with them weekly to get more quality of life things in.

dariusonsched · 2021-12-08T00:10:22Z

@NickCraver @deepakverma may we get an update of your progress on this issue? Thanks!

NickCraver · 2021-12-12T15:27:53Z

@dariusonsched Marc and I have been slammed but are looking into 2 things here: 1) backlog/retry policy (see #1912), a thread stall issue related to that - which leads to us considering defaulting to the built-in thread pool for the socket manager in the 2.5x release on .NET 6.0+ environments (which have some sync-over-async protections it was part of what it was originally designed around).

NickCraver · 2022-02-06T03:13:42Z

For anyone curious, this is happening in #1912 and will be available in the v2.5 release :)

NickCraver added ☁️ platform:Azure ⚙️ area:connection ➕ enhancement labels Nov 20, 2016

NickCraver added the ⏱️ timeout label Sep 2, 2017

NickCraver assigned deepakverma Jun 23, 2021

NickCraver closed this as completed Feb 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient fault tolerance and implicit retry support discussion #399

Transient fault tolerance and implicit retry support discussion #399

Plasma commented Apr 27, 2016

sajad-deyargaroo commented May 3, 2016 •

edited

coronag commented May 3, 2016 •

edited

sajad-deyargaroo commented May 3, 2016 •

edited

iangriffin commented Nov 16, 2016

ericoldre commented Aug 1, 2017

NickCraver commented Jun 23, 2021

Plasma commented Jun 28, 2021

NickCraver commented Jun 28, 2021

dariusonsched commented Dec 8, 2021

NickCraver commented Dec 12, 2021

NickCraver commented Feb 6, 2022

Transient fault tolerance and implicit retry support discussion #399

Transient fault tolerance and implicit retry support discussion #399

Comments

Plasma commented Apr 27, 2016

The Problem

Solution proposal

sajad-deyargaroo commented May 3, 2016 • edited

coronag commented May 3, 2016 • edited

sajad-deyargaroo commented May 3, 2016 • edited

iangriffin commented Nov 16, 2016

ericoldre commented Aug 1, 2017

NickCraver commented Jun 23, 2021

Plasma commented Jun 28, 2021

NickCraver commented Jun 28, 2021

dariusonsched commented Dec 8, 2021

NickCraver commented Dec 12, 2021

NickCraver commented Feb 6, 2022

sajad-deyargaroo commented May 3, 2016 •

edited

coronag commented May 3, 2016 •

edited

sajad-deyargaroo commented May 3, 2016 •

edited