Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[batch] Remove blob storage read inside Worker job creation endpoint #14456

Open
daniel-goldstein opened this issue Apr 10, 2024 · 0 comments
Open

Comments

@daniel-goldstein
Copy link
Contributor

What happened?

Below is a high level overview of how the batch driver communicates scheduled jobs to worker nodes.

Scheduling loop on the driver:

  1. Select N ready jobs from the database to schedule on available workers
  2. Compute placement of a subset of the jobs in available slots in the worker pool
  3. Concurrently call /api/v1alpha/batches/jobs/create on available workers for each placed job. If/when the request completes successfully, the job is marked as scheduled.
  4. Once all requests complete, goto 1

On the worker, what happens inside /api/v1alpha/batches/jobs/create:

  1. Read metadata describing the job to schedule from the request body
  2. Using that information, load the full job spec from blob storage
  3. Spawn a task to run the job asynchronously
  4. Respond to the driver with a 200

The key point relevant to this issue is that the driver currently must wait for all the requests to workers in an iteration to complete before it starts the next iteration of the scheduler. This leaves the scheduler vulnerable to problematic workers or workers that happen to be preempted during the scheduling process. So, the driver sets a 2 second timeout on the call to /api/v1alpha/batches/jobs/create. Additionally, this general design means that in the event of a request timeout or transient error, Batch cannot guarantee that there is always at most one concurrent running attempt for a given job. This ends up being a fine (and intentional) concession in practice because the idempotent design of preemptible jobs tends to cover this scenario, but it is regardless wasted compute and cost to users.

Nevertheless, we strive to minimize cases where we might halt the scheduling loop or double-schedule work, and one way to do that in the current design is to minimize the variance in latency of /api/v1alpha/batches/jobs/create. The largest source of this latency is the request to blob storage. While GCS and ABS are relatively fast and highly available, Batch in Azure Terra requires first obtaining SAS tokens from the Terra control plane, which can introduce much higher and more variable latency. There have also been occurrences in the past of corrupted or deleted specs, which introduce unexpected failure modes that should error the job but instead disrupt the scheduling loop.

Many of these problems would be mitigated by moving the read from object storage outside of the /api/v1alpha/batches/jobs/create endpoint. The endpoint should push this read into the asynchronous task that ultimately runs the job and therefore return its acknowledgement to the driver faster. If the worker encounters errors later on while reading the spec, those should result in erroring the job instead of raising a 500 in the scheduling request.

Version

0.2.129

Relevant log output

No response

@daniel-goldstein daniel-goldstein added enhancement batch needs-triage A brand new issue that needs triaging. labels Apr 10, 2024
@daniel-goldstein daniel-goldstein removed the needs-triage A brand new issue that needs triaging. label Apr 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant