Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transient fault tolerance and implicit retry support discussion #399

Closed
Plasma opened this issue Apr 27, 2016 · 11 comments
Closed

Transient fault tolerance and implicit retry support discussion #399

Plasma opened this issue Apr 27, 2016 · 11 comments

Comments

@Plasma
Copy link

Plasma commented Apr 27, 2016

The Problem

We're a heavy user of Azure Redis Cache; and the platform will sometimes (eg once a month) reboot the underlying host OS for platform updates, causing our primary redis cache instance to go down. The secondary instance Azure runs takes over, but only after a moment of several command failures.

When these events happen, sockets are disconnected, commands fail, and timeouts momentarily occur and SE.Redis rightfully has to throw that exception.

Here's an example of the exceptions we may see during this time:

[RedisConnectionException: SocketFailure on SMEMBERS]
...
Message: [IOException: Unable to read data from the transport connection: An existing connection was forcibly closed by the remote host.]
System.Net.Security._SslStream.EndRead(IAsyncResult asyncResult):174
StackExchange.Redis.PhysicalConnection.EndReading(IAsyncResult result):17
Message: [SocketException: An existing connection was forcibly closed by the remote host]
System.Net.Sockets.NetworkStream.EndRead(IAsyncResult asyncResult):99

And:

Message: [RedisConnectionException: No connection is available to service this operation: EXEC]
StackExchange.Redis.ConnectionMultiplexer.ThrowFailed[T](TaskCompletionSource1 source, Exception unthrownException):0 System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw():12 System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task task):41 System.Runtime.CompilerServices.TaskAwaiter1.GetResult():11
... app code ...

And:

Message: [TimeoutException: Timeout performing EXEC, inst: 1, mgr: Inactive, err: never, queue: 102, qu: 0, qs: 102, qc: 0, wr: 0, wq: 0, in: 0, ar: 0, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=3,Free=32764,Min=4,Max=32767), clientName: RD0003FFAD174E]
StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](Message message, ResultProcessor1 processor, ServerEndPoint server):949 StackExchange.Redis.RedisBase.ExecuteSync[T](Message message, ResultProcessor1 processor, ServerEndPoint server):34
StackExchange.Redis.RedisTransaction.Execute(CommandFlags flags):14

Solution proposal

Azure, Amazon and other cloud services all toe the line that things need to handle transient faults. SQL Azure drivers handle this with built-in retry support on the app-code side, but the recommended driver (SE.Redis) has no command retry support to shield app code from these transient faults.

It's not an easy problem to solve: Not all commands should necessarily be retryable as they are not idempotent (eg, INCR operations) that may or may not have succeeded.

I am wondering what the thoughts are about how to best approach a solution to this problem.

Driver level support by SE.Redis that could have a "command retry" option that blindly retries all commands (or optionally only "idempotent commands" like SET or GET) upon any form of connection/timeout failure for up to X retries would be a pretty good initial solution.

Ideally, there was perhaps a Redis-level "This server is shutting down" command that could warn the driver to pause sending any commands for a few moments while the underlying secondary takes over would be better, but that's a more co-ordinated solution also involving the Redis team.

Thoughts?

@sajad-deyargaroo
Copy link

sajad-deyargaroo commented May 3, 2016

There is retry mechanism in SE.Redis, we can use below options in the connection string:

abortConnect=false
connectRetry=3000
connectTimeout=600000
syncTimeout=600000

Below is the sample connection string:

"contoso.redis.cache.windows.net,abortConnect=false,connectRetry=3000,connectTimeout=600000,syncTimeout=600000,ssl=true,password=weweweweweZNw1L4bIo0DgPxD9ytdwewe="

@coronag
Copy link

coronag commented May 3, 2016

Using : StackExchange.Redis and ServiceStack.Redis

For documentation about the "connectRetry" parameter, I found this :
// RetryTimeout = 3000, (default 3000ms) // To improve the resilience of client connections, RedisClient will transparently retry failed Redis operations due to Socket and I/O Exceptions in an exponential backoff starting from 10ms up until the RetryTimeout of 3000ms. These defaults can be tweaked with: RedisConfig.DefaultRetryTimeout = 3000; RedisConfig.BackOffMultiplier = 10;

=> I already have the same issue at some times. But that was with the default value of "3000ms".

This will not solve the problem exposed (not a transiant failure mecanism) but could help a little ? .
I was wondering about the impact of changing that default value to a higher one (5 seconds for instance). I don't want it to decrease performance if this is happening at a higher rate than expected.

@sajad-deyargaroo
Copy link

sajad-deyargaroo commented May 3, 2016

connectRetryspecifies the number of connect attempts during initial connect and is not about the time, abortConnectspecifies whether retries should happen at all, connectTimeoutand sycTimeoutare for timeout of connect and sync operations respectively.

Also, you are using the Azure Redis Cache and the recommended client for the same is StackExchange.Redis but the documentation section that you pasted above seems to be from ServiceStack.Redis, can you please confirm, which Redis client you are using.

@iangriffin
Copy link

The original discussion was about retrying operations, not connections. It is my understanding that connectRetry, connectTimeout and abortConnect relate to retrying the actual connection, not the operation. Retrying the get/set operations would be extremely helpful and I'm currently looking for solutions before building one from scratch.

I can't find anything about syncTimeout.

@ericoldre
Copy link

bump for this question.
Looking for guidance on retrying SET or GET operations in the event there is a temporary network issue that causes a SocketException.

We are migrating from ServiceStack to StackExchange client and in the code we are replacing, which used ServiceStack, we caught exceptions and would retry operations after a short thread.sleep. On most occasions the retry would work.

If there is a network issue that causes a System.Net.SocketException such as "An established connection was aborted by the software in your host machine" or "An existing connection was forcibly closed by the remote host" does StackExchange.Redis automatically retry up until the syncTimeout time has elapsed?

If not, are there any suggested steps that should happen between the initial failure and a retry in our code? Such as:

  • recreating the multiplexer? (I'd guess not)
  • waiting a short amount of time?
  • calling Close() and Configure()?

Just for clarification, I am talking about network issues when attempting StringSet or StringGet. Not when trying to initially connect to the Redis server.

@NickCraver
Copy link
Collaborator

I know this is an old issue, and something we haven't gotten to, but @deepakverma is now working on approaches here. Expect some retry semantics to be configurable in an upcoming release.

@Plasma
Copy link
Author

Plasma commented Jun 28, 2021

@NickCraver I believe the Azure Redis team have thought about retry handling and failover notification support for drivers before, just want to mention in case your team wanted to reach out to them to discuss ideas and come up a nice convention others can follow too to have a smoother planned failover occur.

@NickCraver
Copy link
Collaborator

@Plasma Deepak's on that team ;) We are indeed syncing with them weekly to get more quality of life things in.

@dariusonsched
Copy link

@NickCraver @deepakverma may we get an update of your progress on this issue? Thanks!

@NickCraver
Copy link
Collaborator

@dariusonsched Marc and I have been slammed but are looking into 2 things here: 1) backlog/retry policy (see #1912), a thread stall issue related to that - which leads to us considering defaulting to the built-in thread pool for the socket manager in the 2.5x release on .NET 6.0+ environments (which have some sync-over-async protections it was part of what it was originally designed around).

@NickCraver
Copy link
Collaborator

For anyone curious, this is happening in #1912 and will be available in the v2.5 release :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

8 participants