Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large number of offline build executors when there is no capacity #413

Open
koenvandesande opened this issue Sep 26, 2023 · 0 comments
Open
Labels
bug Something isn't working

Comments

@koenvandesande
Copy link

Jenkins and plugins versions report

Environment
google-compute-engine:4.3.14
google-oauth-plugin:1.0.8

What Operating System are you using (both controller, and any agents involved in the problem)?

Linux AMD64 - Ubuntu

Reproduction steps

  1. Create a template for e.g. a GPU node, which are in short supply
  2. Select a region with no capacity left
  3. Try to run a job
  4. See up to 20 build executors created, all in the "offline" state
Started provisioning jenkins-worker-4-a100-gpu-b-q4vmv1 from gce-GCE with 1 executors. Remaining excess workload: 0

Sep 26, 2023 4:36:02 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineCloud lambda$getPlannedNodeFuture$0

Waiting 300000ms for node jenkins-worker-4-a100-gpu-b-q4vmv1 to connect

Sep 26, 2023 4:36:12 PM INFO com.google.jenkins.plugins.computeengine.ComputeEngineComputerLauncher launch

Launch failed while waiting for operation operation-1695738961191-60643fe69e3e8-bcfd7998-b4900d7d to complete. Operation error was The zone 'projects/data-fabric-272407/zones/europe-west4-b' does not have enough resources available to fulfill the request.  '(resource type:compute)'.

Sep 26, 2023 4:36:19 PM WARNING hudson.slaves.NodeProvisioner update

Unexpected exception encountered while provisioning agent jenkins-worker-4-a100-gpu-b-q4vmv1

java.io.IOException: Agent failed to connect, even though the launcher didn't report it. See the log output for details.
	at hudson.slaves.SlaveComputer.lambda$_connect$0(SlaveComputer.java:326)
Caused: java.util.concurrent.ExecutionException
	at java.base/java.util.concurrent.FutureTask.report(FutureTask.java:122)
	at java.base/java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at com.google.jenkins.plugins.computeengine.ComputeEngineCloud.lambda$getPlannedNodeFuture$0(ComputeEngineCloud.java:315)
	at jenkins.util.ContextResettingExecutorService$2.call(ContextResettingExecutorService.java:46)
	at jenkins.security.ImpersonatingExecutorService$2.call(ImpersonatingExecutorService.java:80)
	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
	at java.base/java.lang.Thread.run(Thread.java:829)

Expected Results

The plugin should recognize that the launch of the node failed (no capacity), and not show it in the list of build executors at all.

Actual Results

A list of 20 build executors, all offline.

Anything else?

No response

@koenvandesande koenvandesande added the bug Something isn't working label Sep 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant