Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable retry delay and jitter #676

Draft
wants to merge 5 commits into
base: master
Choose a base branch
from
Draft

Conversation

basil
Copy link
Member

@basil basil commented Oct 11, 2023

Allow the retry delay to be configured rather than hard-coded to 10 seconds, and allow the fixed delay to be combined with jitter to help avoid thundering herds.

@basil basil added enhancement For changelog: An enhancement providing new capability. work-in-progress labels Oct 11, 2023
@basil
Copy link
Member Author

basil commented Oct 11, 2023

@rahulsom I seem to recall you mentioning this problem before; would this remove your need for any custom wrappers?

@rahulsom
Copy link

Thanks! This is nice. It should allow me to rely on this instead of custom logic.

@basil
Copy link
Member Author

basil commented Oct 12, 2023

@rahulsom Are you interested in ripping our your custom logic in favor of the incremental build from this PR to validate that it works?

@rahulsom
Copy link

I'm on a brief break but will test it out sometime next week. Is that cool?

@basil
Copy link
Member Author

basil commented Oct 12, 2023

Sure, thanks!

src/main/java/hudson/remoting/jnlp/Main.java Outdated Show resolved Hide resolved
src/main/java/hudson/remoting/jnlp/Main.java Outdated Show resolved Hide resolved
src/main/java/hudson/remoting/jnlp/Main.java Outdated Show resolved Hide resolved
@basil
Copy link
Member Author

basil commented Nov 7, 2023

@rahulsom Are you still interested in kicking the tires on this?

@rahulsom
Copy link

rahulsom commented Nov 8, 2023

Sorry about the delay!
I tried a few things - I either ended up with a condition where the connection never crashes or it crashes so badly that the process terminates.
Do you have hints on how to cause this particular kind of reconnect?

@basil
Copy link
Member Author

basil commented Nov 8, 2023

@rahulsom Reconnect can be triggered easily by restarting the controller while inbound agent(s) are connected. For example, create an agent in the UI with a Launch method of Launch agent by connecting it to the controller, then download the incremental build of Remoting from this PR and run java -jar /path/to/remoting.jar -url https://${JENKINS_URL} -secret ${SECRET} -name ${AGENT_NAME}. When you restart the controller you should see "Terminated" in the agent logs followed by "Performing onReconnect operation" 10 seconds later, since the default value of -retryDelay is 10 seconds.

When multiple agents are launched at the same time and the controller subsequently restarts, all agents should notice and start reconnecting at the same time, creating a thundering herd. The new jitter functionality being introduced in this PR can then be used to solve the thundering herd problem via the newly-introduced -retryJitter or -retryJitterFactor options. The idea would be that anyone who uses a custom wrapper to introduce jitter should be able to remove the custom wrapper, observe the thundering herd problem, and then observe that the problem goes away when the wrapper is replaced with the -retryJitter or -retryJitterFactor options from this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement For changelog: An enhancement providing new capability. work-in-progress
Projects
None yet
3 participants