Skip to content

Conversation

SasSwart
Copy link
Contributor

@SasSwart SasSwart commented Oct 9, 2025

This PR adds troubleshooting steps to guide Coder operators when they suspect that prebuilds might have overwhelmed their deployments.

Closes #19490

@SasSwart SasSwart marked this pull request as ready for review October 9, 2025 08:32
Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work 🚀 Should we also mention that users can tune the CODER_PREBUILDS_RECONCILIATION_INTERVAL to manage how frequently the prebuild reconciliation loop runs? That might help reduce the load from frequent reconciliations. Wdyt?

Comment on lines 254 to 255
1. **Organic overload**: Not enough provisioners to meet the deployment's needs
2. **Broken template**: A template that mistakenly requests too many prebuilt workspaces
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the issue here is actually a combination of these two factors: there aren’t enough resources to handle the high demand from prebuild-related provisioner jobs. This problem can be further amplified when those jobs take a long time to complete.

Additionally, might be worth explanation an additional scenario when a user creates a new template version (a user-initiated job), once this is processed and the prebuild reconciliation loop runs, it adds even more load by scheduling new prebuild-related jobs. This means the queue could now include jobs for both template version 1 and version 2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll adjust the wording to indicate that these issues aren't mutually exclusive, but I don't quite understand what else to change in response to this feedback.

Can you perhaps elaborate or rephrase to help me understand exactly what you need to me to consider changing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The root cause is that prebuild-related jobs are being scheduled faster than provisioner daemons can process them. Multiple scenarios can lead to this (too few provisioners, templates that request many prebuilds, or bursts from publishing new template versions). It might be clearer to frame it that way and then help readers identify when this is happening. Here’s a suggestion that rephrases the paragraph accordingly and adds some guidance for detection:

Prebuilt workspaces can overwhelm a Coder deployment when the number of prebuild‑related jobs exceeds the capacity of the provisioner daemons. This often occurs when one or more templates are published or configured to require more prebuilds than the system can sustain. The impact is amplified when prebuild jobs take a long time to complete, which keeps capacity occupied and increases queue depth. This can cause significant delays when users create new workspaces or when template administrators publish templates.

To identify if this is happening:
- Large or growing queue of prebuild‑related jobs
- User workspace creation is slow
- Publishing a new template version is not reflected in the UI because the associated template import job has not yet finished


Human-initiated jobs have priority over pending prebuild jobs, but running prebuild jobs cannot be preempted. A long list of pending prebuild jobs increases the likelihood that all provisioners are already occupied when a user wants to create a workspace. This increases the likelihood that users will experience delays waiting for the next available provisioner.

To ensure that the next available provisioner will be given to a human-initiated job, run:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m not sure this sentence is entirely accurate. Since human-initiated jobs already have priority over prebuild-related jobs, the next available provisioner will automatically be assigned a human-initiated job if there is one. The purpose of this behavior is to help clear the queue and prevent situations where all provisioner daemons are occupied with prebuild-related jobs, which could delay human-initiated ones.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated it.

To ensure that the next available provisioner will be given to a human-initiated job, run:

```bash
coder provisioner jobs list --status=pending --initiator=prebuilds | jq -r '.[].id' | xargs -n1 -P2 -I{} coder provisioner jobs cancel {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, this command won’t actually print the list of jobs — it will pipe them directly into jq. I think it would be useful to show the list of jobs first, so users can review them before deciding to cancel. That way, they could choose to cancel only a subset of prebuilds if needed.

Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs?
Right now, we don’t support cancelling multiple jobs simultaneously (either through the CLI or the dashboard), so adding that capability would be a nice improvement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs?

I would love to do this, but its beyond scope for this issue. We can definitely implement it and then come back to revise the documentation.

Right now, we don’t support cancelling multiple jobs simultaneously

This is also something we can add in the future.

I think it would be useful to show the list of jobs first

This command before this one in the documentation shows the list. In the context of prebuilds specifically, there should be no reason to cancel only a subset. These jobs are transient and will be replaced once coder prebuilds resume is executed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would love to do this, but its beyond scope for this issue. We can definitely implement it and then come back to revise the documentation.

Yes, this is out of scope for this PR, but would definitely be a nice-to-have 👍

This command before this one in the documentation shows the list. In the context of prebuilds specifically, there should be no reason to cancel only a subset. These jobs are transient and will be replaced once coder prebuilds resume is executed.

A possible reason I see for canceling a subset is if you want to just cancel in-progress prebuilds from a specific template and version...Another idea would be to have the list command with --template and --version flags. But this is also out of scope of this PR, we can update this documentation if we change the jobs list and jobs cancel commands 👍

@david-fraley david-fraley self-requested a review October 9, 2025 13:10
@SasSwart SasSwart requested a review from ssncferreira October 13, 2025 12:40
If you include too many fields, Terraform might ignore changes that wouldn't otherwise cause drift.

Learn more about `ignore_changes` in the [Terraform documentation](https://developer.hashicorp.com/terraform/language/meta-arguments/lifecycle#ignore_changes).
Learn more about `ignore_changes` in the [Terraform documentation](https://developer.hashicorp.com/terraform/language/v1.13.x/meta-arguments#lifecycle).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚫 [linkspector] reported by reviewdog 🐶
Cannot reach https://developer.hashicorp.com/terraform/language/v1.13.x/meta-arguments#lifecycle Status: 429

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was this link changed? 🤔 The previous one was version-agnostic, which might be preferable so we don’t have to update it when Terraform releases new versions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The previous link pointed to a heading that doesn't exist.
Terraform have in the last few versions also changed the content of the page such that the link wouldn't have made sense either way. I pinned it to a specific version so that we have confidence that what we link to doesn't change without us realizing it.

The trade-off is between currency and accuracy. If you'd prefer to keep the old link, I can do that.

Copy link
Contributor

@ssncferreira ssncferreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

@SasSwart SasSwart merged commit 06db587 into main Oct 14, 2025
30 checks passed
@SasSwart SasSwart deleted the jjs/coder-19490 branch October 14, 2025 11:20
@github-actions github-actions bot locked and limited conversation to collaborators Oct 14, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Docs: Add troubleshooting section to prebuilds docs for handling provisioner job saturation

2 participants