JENKINS-73096 Create config for check interval and use the same default #968
+11
−9
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR allowed us to mitigate an active incident. The previous behavior is as follows:
This isn't an issue when we are terminating less than 600 nodes. But since describeInstances take around 100ms. When UpdateComputerList is called, to update every node, the total duration will exceed the current hard coded value of 1 minutes. So every computer termination takes 1 min to finish, during this time Queue lock is taken so no new work can be accepted.
TESTING PLAN
Installed on our controllers, launched 800 nodes via minInstances, then set minInstances to 0. Instead of the controller locking up until it reaches below 700 idle instances which takes more than an hour. It cleans all nodes up within 5 min
Submitter checklist