Retry on timeouts #551

cdegroot · 2024-02-15T19:59:27Z

We saw some aggregate execution timeouts which we still need to dig into, but what is most likely happening is that a "big" aggregate is started and starts rehydration, then the command gets queued up and it times out after five seconds because the rehydration is still going on.

Timeouts in this spot should be retryable, which is what this patch implements.

Also fixed maybe_retry as it just returned the error instead of properly doing the Pipeline.respond thing; this was, as far as I can tell, a latent bug.

jwilger · 2024-02-17T00:09:29Z

Given that, as @slashdotdash said in the Slack conversation, this is functionally the same as just doubling the timeout requested when you dispatch, can you explain what the utility is of adding the retry?

Why not just set your execution timeout to 10s?

The dispatch will only take as long as it needs to, so the timeout is specifying the maximum amount of time you are willing to wait for a reply. It seems like this change would just quietly double that maximum. Personally, I think I prefer to be explicit about it.

cdegroot · 2024-02-17T15:29:33Z

So given that command dispatching has taken on the responsibility for doing automatic retries in certain cases, my take is that this is such a "certain case" where an automatic retry is warranted. If we don't think that retries should happen, then all retries should be removed, but I think that "this is what we retry on" should be a somewhat consistent set of retryable errors, and this one - as it is caused by slow rehydration - falls in that category.

Yes, I am aware that categorizing things is hard and always up for debate, but that's my take on it :)

(I still need to figure out why rehydration is that slow now and then, but that's what the telemetry PR hopefully helps us on, both PRs are currently on our main branch)

drteeth · 2024-02-18T18:12:31Z

At this point in the command dispatch we have either located the existing process for the aggregate or have just started one. This could be local or on another node in a cluster.

The three cases that I see are:

Although we've located the process, we've sent it a message and it has since been stopped by its lifetime. We're retrying the execution so that a new instance of the aggregate process will startup and handle our command. We have high confidence that the retry will solve the issue, probably on the first retry.
We've located the process on another node, but that node has since died/been removed from the cluster/etc and can't handle our dispatch, retrying will spin up or find a new process on some healthy node that can dispatch for us. We have high confidence that the retry will solve the issue again.
We've located the process and it timed out re-hydrating itself for a number of possible reasons:
a) Network/database/CPU/Memory is congested/split/unhappy and we didn't get through the events this time. Retrying here at the right time will likely help, retrying at the wrong time may exacerbate the issue.
b) The stream is too long or has too many large events, we can't get through the events in time. Retrying here is unlikely to help. If it does, the problem is likely to get worse over time.

I can see an argument for backing off and trying 3a again, perhaps only a low number of times. There's a chance retrying would result in success. If the issue is long-lived, there's nothing we can do except fail. I think there's also an argument to be made for this not being Commanded's problem.

Letting 3b fail and having the developer either address the long hydration time and/or band-aid it with a longer timeout is probably the right thing to do. Retrying here feels like papering over a problem. If a stop-gap is needed, the timeout can be extended as previously mentioned.

I'm still open to logging a warning, retrying a low number of times here. This would give the developer notice that a fix needs to be made while still honouring the request. At what point is this a) not Commanded's problem, b) overly defensive, c) hiding a problem?

The trouble with 3) is that we can't know what we're dealing with and so need to make a choice that covers both cases. I'm leaning towards not letting problems hide and fester by failing sooner, but could be convinced otherwise.

cdegroot · 2024-02-19T14:59:37Z

(btw, I'm fine with whatever way this goes. We should update the docs though, I think currently it only specifies one reason to retry)

Retry on timeouts

b750116

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retry on timeouts #551

Retry on timeouts #551

cdegroot commented Feb 15, 2024

jwilger commented Feb 17, 2024

cdegroot commented Feb 17, 2024

drteeth commented Feb 18, 2024

cdegroot commented Feb 19, 2024

Retry on timeouts #551

Are you sure you want to change the base?

Retry on timeouts #551

Conversation

cdegroot commented Feb 15, 2024

jwilger commented Feb 17, 2024

cdegroot commented Feb 17, 2024

drteeth commented Feb 18, 2024

cdegroot commented Feb 19, 2024