docs: add troubleshooting steps for prebuilt workspaces #20231

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Merged

SasSwart merged 11 commits into main from jjs/coder-19490

Oct 14, 2025

+78 −1

Contributor

SasSwart commented Oct 9, 2025 •

edited

Loading

This PR adds troubleshooting steps to guide Coder operators when they suspect that prebuilds might have overwhelmed their deployments.

Closes #19490


          docs: add troubleshooting steps for prebuilt workspaces

4b9bdfe

SasSwart requested a review from ssncferreira

October 9, 2025 08:02

github-actions bot assigned SasSwart

SasSwart added 3 commits

October 9, 2025 08:09


          make lint/markdown

846d724


          ask an LLM to review my documentation for grammar, style and tone

c4df5b3


          Make the linter happy

00b5a07

SasSwart marked this pull request as ready for review

October 9, 2025 08:32

ssncferreira reviewed

View reviewed changes

Contributor

ssncferreira left a comment

Nice work 🚀 Should we also mention that users can tune the CODER_PREBUILDS_RECONCILIATION_INTERVAL to manage how frequently the prebuild reconciliation loop runs? That might help reduce the load from frequent reconciliations. Wdyt?

docs/admin/templates/extending-templates/prebuilt-workspaces.md Outdated

Comment on lines 254 to 255

    
              1. **Organic overload**: Not enough provisioners to meet the deployment's needs

              2. **Broken template**: A template that mistakenly requests too many prebuilt workspaces

Contributor

ssncferreira Oct 9, 2025

I think the issue here is actually a combination of these two factors: there aren’t enough resources to handle the high demand from prebuild-related provisioner jobs. This problem can be further amplified when those jobs take a long time to complete.

Additionally, might be worth explanation an additional scenario when a user creates a new template version (a user-initiated job), once this is processed and the prebuild reconciliation loop runs, it adds even more load by scheduling new prebuild-related jobs. This means the queue could now include jobs for both template version 1 and version 2.

Contributor Author

SasSwart Oct 13, 2025

I'll adjust the wording to indicate that these issues aren't mutually exclusive, but I don't quite understand what else to change in response to this feedback.

Can you perhaps elaborate or rephrase to help me understand exactly what you need to me to consider changing?

Contributor

ssncferreira Oct 13, 2025

The root cause is that prebuild-related jobs are being scheduled faster than provisioner daemons can process them. Multiple scenarios can lead to this (too few provisioners, templates that request many prebuilds, or bursts from publishing new template versions). It might be clearer to frame it that way and then help readers identify when this is happening. Here’s a suggestion that rephrases the paragraph accordingly and adds some guidance for detection:

Prebuilt workspaces can overwhelm a Coder deployment when the number of prebuild‑related jobs exceeds the capacity of the provisioner daemons. This often occurs when one or more templates are published or configured to require more prebuilds than the system can sustain. The impact is amplified when prebuild jobs take a long time to complete, which keeps capacity occupied and increases queue depth. This can cause significant delays when users create new workspaces or when template administrators publish templates.

To identify if this is happening:
- Large or growing queue of prebuild‑related jobs
- User workspace creation is slow
- Publishing a new template version is not reflected in the UI because the associated template import job has not yet finished

docs/admin/templates/extending-templates/prebuilt-workspaces.md Outdated Show resolved Hide resolved

docs/admin/templates/extending-templates/prebuilt-workspaces.md Outdated Show resolved Hide resolved

docs/admin/templates/extending-templates/prebuilt-workspaces.md Outdated Show resolved Hide resolved

docs/admin/templates/extending-templates/prebuilt-workspaces.md Outdated

    
              Human-initiated jobs have priority over pending prebuild jobs, but running prebuild jobs cannot be preempted. A long list of pending prebuild jobs increases the likelihood that all provisioners are already occupied when a user wants to create a workspace. This increases the likelihood that users will experience delays waiting for the next available provisioner.

              To ensure that the next available provisioner will be given to a human-initiated job, run:

Contributor

ssncferreira Oct 9, 2025

I’m not sure this sentence is entirely accurate. Since human-initiated jobs already have priority over prebuild-related jobs, the next available provisioner will automatically be assigned a human-initiated job if there is one. The purpose of this behavior is to help clear the queue and prevent situations where all provisioner daemons are occupied with prebuild-related jobs, which could delay human-initiated ones.

Contributor Author

SasSwart Oct 13, 2025

I've updated it.

docs/admin/templates/extending-templates/prebuilt-workspaces.md

    
              To ensure that the next available provisioner will be given to a human-initiated job, run:

              ```bash

              coder provisioner jobs list --status=pending --initiator=prebuilds | jq -r '.[].id' | xargs -n1 -P2 -I{} coder provisioner jobs cancel {}

Contributor

ssncferreira Oct 9, 2025

AFAIU, this command won’t actually print the list of jobs — it will pipe them directly into jq. I think it would be useful to show the list of jobs first, so users can review them before deciding to cancel. That way, they could choose to cancel only a subset of prebuilds if needed.

Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs?
Right now, we don’t support cancelling multiple jobs simultaneously (either through the CLI or the dashboard), so adding that capability would be a nice improvement.

Contributor Author

SasSwart Oct 13, 2025

Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs?

I would love to do this, but its beyond scope for this issue. We can definitely implement it and then come back to revise the documentation.

Right now, we don’t support cancelling multiple jobs simultaneously

This is also something we can add in the future.

I think it would be useful to show the list of jobs first

This command before this one in the documentation shows the list. In the context of prebuilds specifically, there should be no reason to cancel only a subset. These jobs are transient and will be replaced once coder prebuilds resume is executed.

Contributor

ssncferreira Oct 13, 2025

I would love to do this, but its beyond scope for this issue. We can definitely implement it and then come back to revise the documentation.

Yes, this is out of scope for this PR, but would definitely be a nice-to-have 👍

This command before this one in the documentation shows the list. In the context of prebuilds specifically, there should be no reason to cancel only a subset. These jobs are transient and will be replaced once coder prebuilds resume is executed.

A possible reason I see for canceling a subset is if you want to just cancel in-progress prebuilds from a specific template and version...Another idea would be to have the list command with --template and --version flags. But this is also out of scope of this PR, we can update this documentation if we change the jobs list and jobs cancel commands 👍

docs/admin/templates/extending-templates/prebuilt-workspaces.md Show resolved Hide resolved

david-fraley self-requested a review

October 9, 2025 13:10

SasSwart added 2 commits

October 13, 2025 12:29


          Improve formatting and accuracy

a3abf87


          Add a note about infrastructure housekeeping

f76f35e

SasSwart requested a review from ssncferreira

October 13, 2025 12:40


          Pin the version of a link to Terraform's documentation

a387d63

github-actions bot reviewed

View reviewed changes

docs/admin/templates/extending-templates/prebuilt-workspaces.md Outdated

    
              If you include too many fields, Terraform might ignore changes that wouldn't otherwise cause drift.

              Learn more about `ignore_changes` in the [Terraform documentation](https://developer.hashicorp.com/terraform/language/meta-arguments/lifecycle#ignore_changes).

              Learn more about `ignore_changes` in the [Terraform documentation](https://developer.hashicorp.com/terraform/language/v1.13.x/meta-arguments#lifecycle).

github-actions bot Oct 13, 2025

🚫 [linkspector] _{reported by reviewdog 🐶}
Cannot reach https://developer.hashicorp.com/terraform/language/v1.13.x/meta-arguments#lifecycle Status: 429

Contributor

ssncferreira Oct 13, 2025

Why was this link changed? 🤔 The previous one was version-agnostic, which might be preferable so we don’t have to update it when Terraform releases new versions.

Contributor Author

SasSwart Oct 14, 2025

The previous link pointed to a heading that doesn't exist.
Terraform have in the last few versions also changed the content of the page such that the link wouldn't have made sense either way. I pinned it to a specific version so that we have confidence that what we link to doesn't change without us realizing it.

The trade-off is between currency and accuracy. If you'd prefer to keep the old link, I can do that.

ssncferreira reviewed

View reviewed changes

docs/admin/templates/extending-templates/prebuilt-workspaces.md Outdated Show resolved Hide resolved

SasSwart and others added 2 commits

October 14, 2025 09:05


          Update docs/admin/templates/extending-templates/prebuilt-workspaces.md

7d500db

Co-authored-by: Susana Ferreira <susana@coder.com>


          Elaborate on the symptoms for when prebuilds overwhelm provisioners

84674c5

ssncferreira approved these changes

View reviewed changes

Contributor

ssncferreira left a comment

LGTM 👍

SasSwart added 2 commits

October 14, 2025 10:46


          Elaborate on the symptoms for when prebuilds overwhelm provisioners

1f768b3


          Ignore linkspector flake

3e7b063

SasSwart merged commit 06db587 into main

30 checks passed

SasSwart deleted the jjs/coder-19490 branch

October 14, 2025 11:20

github-actions bot locked and limited conversation to collaborators

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet