-
Notifications
You must be signed in to change notification settings - Fork 1k
docs: add troubleshooting steps for prebuilt workspaces #20231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work 🚀 Should we also mention that users can tune the CODER_PREBUILDS_RECONCILIATION_INTERVAL
to manage how frequently the prebuild reconciliation loop runs? That might help reduce the load from frequent reconciliations. Wdyt?
1. **Organic overload**: Not enough provisioners to meet the deployment's needs | ||
2. **Broken template**: A template that mistakenly requests too many prebuilt workspaces |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue here is actually a combination of these two factors: there aren’t enough resources to handle the high demand from prebuild-related provisioner jobs. This problem can be further amplified when those jobs take a long time to complete.
Additionally, might be worth explanation an additional scenario when a user creates a new template version (a user-initiated job), once this is processed and the prebuild reconciliation loop runs, it adds even more load by scheduling new prebuild-related jobs. This means the queue could now include jobs for both template version 1 and version 2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll adjust the wording to indicate that these issues aren't mutually exclusive, but I don't quite understand what else to change in response to this feedback.
Can you perhaps elaborate or rephrase to help me understand exactly what you need to me to consider changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The root cause is that prebuild-related jobs are being scheduled faster than provisioner daemons can process them. Multiple scenarios can lead to this (too few provisioners, templates that request many prebuilds, or bursts from publishing new template versions). It might be clearer to frame it that way and then help readers identify when this is happening. Here’s a suggestion that rephrases the paragraph accordingly and adds some guidance for detection:
Prebuilt workspaces can overwhelm a Coder deployment when the number of prebuild‑related jobs exceeds the capacity of the provisioner daemons. This often occurs when one or more templates are published or configured to require more prebuilds than the system can sustain. The impact is amplified when prebuild jobs take a long time to complete, which keeps capacity occupied and increases queue depth. This can cause significant delays when users create new workspaces or when template administrators publish templates.
To identify if this is happening:
- Large or growing queue of prebuild‑related jobs
- User workspace creation is slow
- Publishing a new template version is not reflected in the UI because the associated template import job has not yet finished
docs/admin/templates/extending-templates/prebuilt-workspaces.md
Outdated
Show resolved
Hide resolved
docs/admin/templates/extending-templates/prebuilt-workspaces.md
Outdated
Show resolved
Hide resolved
docs/admin/templates/extending-templates/prebuilt-workspaces.md
Outdated
Show resolved
Hide resolved
|
||
Human-initiated jobs have priority over pending prebuild jobs, but running prebuild jobs cannot be preempted. A long list of pending prebuild jobs increases the likelihood that all provisioners are already occupied when a user wants to create a workspace. This increases the likelihood that users will experience delays waiting for the next available provisioner. | ||
|
||
To ensure that the next available provisioner will be given to a human-initiated job, run: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I’m not sure this sentence is entirely accurate. Since human-initiated jobs already have priority over prebuild-related jobs, the next available provisioner will automatically be assigned a human-initiated job if there is one. The purpose of this behavior is to help clear the queue and prevent situations where all provisioner daemons are occupied with prebuild-related jobs, which could delay human-initiated ones.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated it.
To ensure that the next available provisioner will be given to a human-initiated job, run: | ||
|
||
```bash | ||
coder provisioner jobs list --status=pending --initiator=prebuilds | jq -r '.[].id' | xargs -n1 -P2 -I{} coder provisioner jobs cancel {} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAIU, this command won’t actually print the list of jobs — it will pipe them directly into jq. I think it would be useful to show the list of jobs first, so users can review them before deciding to cancel. That way, they could choose to cancel only a subset of prebuilds if needed.
Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs?
Right now, we don’t support cancelling multiple jobs simultaneously (either through the CLI or the dashboard), so adding that capability would be a nice improvement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn’t it make more sense for coder provisioner jobs cancel to accept a list of job IDs?
I would love to do this, but its beyond scope for this issue. We can definitely implement it and then come back to revise the documentation.
Right now, we don’t support cancelling multiple jobs simultaneously
This is also something we can add in the future.
I think it would be useful to show the list of jobs first
This command before this one in the documentation shows the list. In the context of prebuilds specifically, there should be no reason to cancel only a subset. These jobs are transient and will be replaced once coder prebuilds resume
is executed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would love to do this, but its beyond scope for this issue. We can definitely implement it and then come back to revise the documentation.
Yes, this is out of scope for this PR, but would definitely be a nice-to-have 👍
This command before this one in the documentation shows the list. In the context of prebuilds specifically, there should be no reason to cancel only a subset. These jobs are transient and will be replaced once coder prebuilds resume is executed.
A possible reason I see for canceling a subset is if you want to just cancel in-progress prebuilds from a specific template and version...Another idea would be to have the list command with --template
and --version
flags. But this is also out of scope of this PR, we can update this documentation if we change the jobs list
and jobs cancel
commands 👍
If you include too many fields, Terraform might ignore changes that wouldn't otherwise cause drift. | ||
|
||
Learn more about `ignore_changes` in the [Terraform documentation](https://developer.hashicorp.com/terraform/language/meta-arguments/lifecycle#ignore_changes). | ||
Learn more about `ignore_changes` in the [Terraform documentation](https://developer.hashicorp.com/terraform/language/v1.13.x/meta-arguments#lifecycle). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🚫 [linkspector] reported by reviewdog 🐶
Cannot reach https://developer.hashicorp.com/terraform/language/v1.13.x/meta-arguments#lifecycle Status: 429
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was this link changed? 🤔 The previous one was version-agnostic, which might be preferable so we don’t have to update it when Terraform releases new versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous link pointed to a heading that doesn't exist.
Terraform have in the last few versions also changed the content of the page such that the link wouldn't have made sense either way. I pinned it to a specific version so that we have confidence that what we link to doesn't change without us realizing it.
The trade-off is between currency and accuracy. If you'd prefer to keep the old link, I can do that.
docs/admin/templates/extending-templates/prebuilt-workspaces.md
Outdated
Show resolved
Hide resolved
Co-authored-by: Susana Ferreira <susana@coder.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM 👍
This PR adds troubleshooting steps to guide Coder operators when they suspect that prebuilds might have overwhelmed their deployments.
Closes #19490