Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JENKINS-73096 Create config for check interval and use the same default #968

Conversation

sfc-gh-mayang
Copy link
Contributor

@sfc-gh-mayang sfc-gh-mayang commented May 8, 2024

This PR allowed us to mitigate an active incident. The previous behavior is as follows:

  • terminate node is called
  • updateComputerList calls updateComputer on every node
  • updateComputer calls SlaveComputer.check which runs describeInstances if time is after nextCheckAfter

This isn't an issue when we are terminating less than 600 nodes. But since describeInstances take around 100ms. When UpdateComputerList is called, to update every node, the total duration will exceed the current hard coded value of 1 minutes. So every computer termination takes 1 min to finish, during this time Queue lock is taken so no new work can be accepted.

TESTING PLAN
Installed on our controllers, launched 800 nodes via minInstances, then set minInstances to 0. Instead of the controller locking up until it reaches below 700 idle instances which takes more than an hour. It cleans all nodes up within 5 min

Submitter checklist

Edit tasklist title
Beta Give feedback Tasklist Submitter checklist, more options

Delete tasklist

Delete tasklist block?
Are you sure? All relationships in this tasklist will be removed.
  1. Make sure you are opening from a topic/feature/bugfix branch (right side) and not your main branch!
    Options
  2. Ensure that the pull request title represents the desired changelog entry
    Options
  3. Please describe what you did
    Options
  4. Link to relevant issues in GitHub or Jira
    Options
  5. Link to relevant pull requests, esp. upstream and downstream changes
    Options
  6. Ensure you have provided tests - that demonstrates feature works or fixes the issue
    Options

@sfc-gh-mayang sfc-gh-mayang changed the title JENKINS-73096 Create config for check interval and default to 2 min JENKINS-73096 Create config for check interval and use the same default May 8, 2024
@res0nance res0nance added the enhancement Feature additions or enhancements label May 20, 2024
@res0nance res0nance merged commit 02b9b26 into jenkinsci:master May 20, 2024
16 checks passed
@mwebber
Copy link

mwebber commented May 20, 2024

@sfc-gh-mayang

Is it possible to document the new setting jenkins.ec2.checkIntervalMinutes somewhere, such as https://plugins.jenkins.io/ec2/ ?

You said

TESTING PLAN
Installed on our controllers, launched 800 nodes via minInstances, then set minInstances to 0. Instead of the controller locking up until it reaches below 700 idle instances which takes more than an hour. It cleans all nodes up within 5 min

What value of jenkins.ec2.checkIntervalMinutes did you use when you did the test?

@sfc-gh-mayang
Copy link
Contributor Author

We used a value of 2 to test

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Feature additions or enhancements
Projects
None yet
3 participants