Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: try different zones if the one specified in config does not have enough resources #335

Open
vlad-ivanov-name opened this issue Jun 10, 2022 · 4 comments
Labels
enhancement New feature or request

Comments

@vlad-ivanov-name
Copy link

vlad-ivanov-name commented Jun 10, 2022

What feature do you want to see added?

Hello,

would it be possible to have an option to use multiple zones for launching agents? Right now if a zone is exhausted, the plugin will not handle it so well: there will be many agents created (above the limit set in config) but they will fail to start, and while deleting them works, it also triggers an exception within Jenkins. Here's a log snippet:

2022-06-10 10:50:47.296+0000 [id=4267]	INFO	c.g.j.p.c.ComputeEngineComputerLauncher#launch: Launch failed while waiting for operation operation-1654858230641-5e115b4fdb4d9-33d8d6e8-ec232c0f to complete. Operation error was The zone 'projects/censored/zones/us-west1-b' does not have enough resources available to fulfill the request.  '(resource type:compute)'.

It would be good to have an option to try different zones from a preconfigured set if one zone doesn't have enough resources.

Exception when deleting an agent that failed to start
2022-06-10 10:50:15.593+0000 [id=3700]	WARNING	h.i.i.InstallUncaughtExceptionHandler#handleException: Caught unhandled exception with ID 95d9316f-d4ab-4107-a055-c25f7c2f40a8
com.google.api.client.googleapis.json.GoogleJsonResponseException: 404 Not Found
{
  "code" : 404,
  "errors" : [ {
    "domain" : "global",
    "message" : "The resource 'projects/censored/zones/us-west1-b/instances/jenkins-agent-dynamic-jtiut7' was not found",
    "reason" : "notFound"
  } ],
  "message" : "The resource 'projects/censored/zones/us-west1-b/instances/jenkins-agent-dynamic-jtiut7' was not found"
}
	at com.google.api.client.googleapis.json.GoogleJsonResponseException.from(GoogleJsonResponseException.java:150)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:113)
	at com.google.api.client.googleapis.services.json.AbstractGoogleJsonClientRequest.newExceptionOnError(AbstractGoogleJsonClientRequest.java:40)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest$1.interceptResponse(AbstractGoogleClientRequest.java:321)
	at com.google.api.client.http.HttpRequest.execute(HttpRequest.java:1056)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:419)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.executeUnparsed(AbstractGoogleClientRequest.java:352)
	at com.google.api.client.googleapis.services.AbstractGoogleClientRequest.execute(AbstractGoogleClientRequest.java:469)
	at com.google.cloud.graphite.platforms.plugin.client.ComputeWrapper.deleteInstance(ComputeWrapper.java:116)
	at com.google.cloud.graphite.platforms.plugin.client.ComputeClient.terminateInstanceAsync(ComputeClient.java:323)
	at com.google.jenkins.plugins.computeengine.ComputeEngineInstance._terminate(ComputeEngineInstance.java:136)
	at hudson.slaves.AbstractCloudSlave.terminate(AbstractCloudSlave.java:88)
	at com.google.jenkins.plugins.computeengine.ComputeEngineComputer.doDoDelete(ComputeEngineComputer.java:181)
	at java.base/java.lang.invoke.MethodHandle.invokeWithArguments(MethodHandle.java:710)
	at org.kohsuke.stapler.Function$MethodFunction.invoke(Function.java:398)
	at org.kohsuke.stapler.Function$InstanceFunction.invoke(Function.java:410)
	at org.kohsuke.stapler.interceptor.RequirePOST$Processor.invoke(RequirePOST.java:78)
	at org.kohsuke.stapler.PreInvokeInterceptedFunction.invoke(PreInvokeInterceptedFunction.java:26)
	at org.kohsuke.stapler.Function.bindAndInvoke(Function.java:208)
...

Upstream changes

No response

@vlad-ivanov-name vlad-ivanov-name added the enhancement New feature or request label Jun 10, 2022
@yarinkos
Copy link

@vlad-ivanov-name, Is this another root cause besides the GCP compute bug? Could it be a quote issue?
It might be a running issue (https://stackoverflow.com/questions/52684656/the-zone-does-not-have-enough-resources-available-to-fulfill-the-request-the-re).
I was just thinking out loud

@vlad-ivanov-name
Copy link
Author

Yeah I checked the quota -- that's not it. I don't think it's a bug per se, it's just the instance Jenkins was trying to spin up needs a GPU and it's common to see GPU resources exhausted within a particular zone.

@craigwatson
Copy link

We also saw similar issues in the London region at roughly the same time - this was indeed caused by resource exhaustion inside GCP, and not by any project quotas.

Ideally, the plug-in would catch this failure and continually retry (potentially with some kind of exponential back off?) until it was successful.

Alternatively, the plug-in could use Instance Groups to keep track of the pool of VMs, although I imagine that would involve a fair amount of work internally as the pool management logic would change.

@spiegelm
Copy link

We would love to have this feature. About each week our builds get stuck with ZONE_RESOURCE_POOL_EXHAUSTED in some zone of europe-west1 and require manual intervention to unblock pull requests or releases.

I guess some cases could already be fixed by round-robin over all zones of this region. Using a specific zone is not an issue for our builds, any zone would be sufficient.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants