-
Notifications
You must be signed in to change notification settings - Fork 237
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry on timeouts #551
base: master
Are you sure you want to change the base?
Retry on timeouts #551
Conversation
Given that, as @slashdotdash said in the Slack conversation, this is functionally the same as just doubling the timeout requested when you dispatch, can you explain what the utility is of adding the retry? Why not just set your execution timeout to 10s? The dispatch will only take as long as it needs to, so the timeout is specifying the maximum amount of time you are willing to wait for a reply. It seems like this change would just quietly double that maximum. Personally, I think I prefer to be explicit about it. |
So given that command dispatching has taken on the responsibility for doing automatic retries in certain cases, my take is that this is such a "certain case" where an automatic retry is warranted. If we don't think that retries should happen, then all retries should be removed, but I think that "this is what we retry on" should be a somewhat consistent set of retryable errors, and this one - as it is caused by slow rehydration - falls in that category. Yes, I am aware that categorizing things is hard and always up for debate, but that's my take on it :) (I still need to figure out why rehydration is that slow now and then, but that's what the telemetry PR hopefully helps us on, both PRs are currently on our main branch) |
At this point in the command dispatch we have either located the existing process for the aggregate or have just started one. This could be local or on another node in a cluster. The three cases that I see are:
I can see an argument for backing off and trying 3a again, perhaps only a low number of times. There's a chance retrying would result in success. If the issue is long-lived, there's nothing we can do except fail. I think there's also an argument to be made for this not being Commanded's problem. Letting 3b fail and having the developer either address the long hydration time and/or band-aid it with a longer timeout is probably the right thing to do. Retrying here feels like papering over a problem. If a stop-gap is needed, the timeout can be extended as previously mentioned. I'm still open to logging a warning, retrying a low number of times here. This would give the developer notice that a fix needs to be made while still honouring the request. At what point is this a) not Commanded's problem, b) overly defensive, c) hiding a problem? The trouble with 3) is that we can't know what we're dealing with and so need to make a choice that covers both cases. I'm leaning towards not letting problems hide and fester by failing sooner, but could be convinced otherwise. |
(btw, I'm fine with whatever way this goes. We should update the docs though, I think currently it only specifies one reason to retry) |
We saw some aggregate execution timeouts which we still need to dig into, but what is most likely happening is that a "big" aggregate is started and starts rehydration, then the command gets queued up and it times out after five seconds because the rehydration is still going on.
Timeouts in this spot should be retryable, which is what this patch implements.
Also fixed
maybe_retry
as it just returned the error instead of properly doing the Pipeline.respond thing; this was, as far as I can tell, a latent bug.