Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how to apply backpressure #1007

Open
petersilva opened this issue Apr 2, 2024 · 3 comments
Open

how to apply backpressure #1007

petersilva opened this issue Apr 2, 2024 · 3 comments
Labels
Design Problem hard to fix, affects design. Developer not a problem, more of a note to self for devs about work to do. Discussion_Needed developers should discuss this issue. enhancement New feature or request NextRelease Feature Targeted for Next Release Priority 3 - Important worrisome impact... should study carefully Refactor change implementation of existing functionality. ReliabilityRecovery improve behaviour in failure situations. UserStory interesting to read to consider improving v3only Only affects v3 branches.

Comments

@petersilva
Copy link
Contributor

petersilva commented Apr 2, 2024

was discussing with @reidsunderland a situation where if a node in a cluster fails, we want the sender to it to stop consuming, letting other consumers of a shared queue take over the entire load if this sender's downstream is broken.

https://en.wikipedia.org/wiki/Backpressure_routing

In v2, backpressure applied naturally, since we processed one message at a time, and simply tried to deliver or download the same item forever. If a delivery was failing, we would never loop back to consume more from the queue.

with sr3, we have both download_retry, and post_retry queues, so that if individual transfers fail, we can keep going. The problem is one can put millions of files in those retry queues, and a failing node may even consume from the queue faster because the failures may be quicker to process than successful deliveries.

So... there need to be some criteria when processings is going badly to stop consuming... that is, to apply backpressure to the upstream side.

@petersilva petersilva changed the title how apply backpressure how to apply backpressure Apr 2, 2024
@petersilva petersilva added enhancement New feature or request Design Problem hard to fix, affects design. Developer not a problem, more of a note to self for devs about work to do. UserStory interesting to read to consider improving v3only Only affects v3 branches. Refactor change implementation of existing functionality. Discussion_Needed developers should discuss this issue. labels Apr 2, 2024
@petersilva
Copy link
Contributor Author

ideas:

  • have a retry_max setting... if the retry_module reports more entries in it's queues than max, then ask for back_pressure()
  • add a flowcb entry_point back_pressure() -> bool if it returns True, then don't call gather...
  • perhaps something to do with have_vip() also.

@reidsunderland
Copy link
Member

Those ideas sound good.

Something we might also want to think about is acking the messages. If we knew there was another node that could process the message/send the file, maybe it’s better to not use the local retry queues at all and just not ack or nack messages when the transfer fails?

In that case, we might want to have sr3 automatically reduce messageRateMax and automatically reset it if the instance detects that the destination problem has been resolved.

@petersilva petersilva added NextRelease Feature Targeted for Next Release ReliabilityRecovery improve behaviour in failure situations. labels Apr 14, 2024
@petersilva petersilva added the Priority 3 - Important worrisome impact... should study carefully label May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Design Problem hard to fix, affects design. Developer not a problem, more of a note to self for devs about work to do. Discussion_Needed developers should discuss this issue. enhancement New feature or request NextRelease Feature Targeted for Next Release Priority 3 - Important worrisome impact... should study carefully Refactor change implementation of existing functionality. ReliabilityRecovery improve behaviour in failure situations. UserStory interesting to read to consider improving v3only Only affects v3 branches.
Projects
None yet
Development

No branches or pull requests

2 participants