`tilt ci` does not recreate failed jobs #6283

dnephin · 2023-12-09T20:34:26Z

Expected Behavior

When tilt ci runs, if it finds a job already exists (and the pod-template-hash matches) but the job has never completed (all pods exited with error) it would recreate the job to start it again.

I'm not sure if this is a bug or a feature request, because I can't find any docs that say it should work this way. Maybe I'm assuming something based on how tilt up behaves?

Current Behavior

tilt seems to attach to an arbitrary pod that has already terminated (it does not appear to be the most recent or the oldest pod). The output in the logs is:

Attaching to existing pod (db-init-cnbxf). Only new logs will be streamed.

Then tilt ci exits immediately with error: Error: Pod "db-init-cnbxf" failed.

Steps to Reproduce

Given these files:

script.sh

#!/usr/bin/env sh 
date
echo some output
exit 1

Tiltfile

load('ext://deployment', 'job_create')
docker_build('db-init', '.', dockerfile_contents="""
FROM busybox
COPY script.sh script.sh
ENTRYPOINT /script.sh
""")
job_create('db-init')

Run tilt ci once, and the output from this job is printed with the date.
Run tilt ci again many times and the pod never runs again (the job controller will recreate the pod occasionally if the spec allows it). tilt ci says its' attaching to the terminated pod, then exits.

Context

About Your Use Case

We create environments for CI ahead of time using tilt ci. When one of those fails due to some flaky test or infrastructure problem we attempt to retry with tilt ci. We've noticed those retries don't end up working most of the time because of this behaviour.

The text was updated successfully, but these errors were encountered:

nicks · 2023-12-11T15:32:55Z

oof, this is tricky. The short version is that this is currently working as designed.

re: "I can't find any docs that say it should work this way" - here's a good doc on tilt's execution model - https://docs.tilt.dev/controlloop. Basically, you can think of it as docker build && kubectl apply && kubectl wait. tilt ci mainly adds exit conditions.

The fundamental problem is that if you kubectl apply a job, and the spec of the Job hasn't changed, then (from Kubernetes' perspective), there's no reason to re-run the job. From the apiserver perspective, the whole contract of apply if that if the spec of an object hasn't changed, then the system should do nothing.

Tilt inherits this behavior -- if the Job hasn't changed, then the Job shouldn't be re-run.

There have been discussions of this over the years (e.g., kubernetes/kubernetes#77396), but lots of stuff relies on this behavior.

I guess the simple workaround right now is to add to your tiltfile like:

if config.tilt_subcommand == 'ci'):
  local('./clean-up-old-jobs.sh')

though i agree that's unsatisfying :(

dnephin added the bug Something isn't working label Dec 9, 2023

dnephin changed the title ~~tilt ci does not always re-run failed jobs when resuming~~ tilt ci does not recreate failed jobs Dec 9, 2023

nicks added enhancement New feature or request and removed bug Something isn't working labels Dec 11, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`tilt ci` does not recreate failed jobs #6283

`tilt ci` does not recreate failed jobs #6283

dnephin commented Dec 9, 2023

nicks commented Dec 11, 2023

tilt ci does not recreate failed jobs #6283

tilt ci does not recreate failed jobs #6283

Comments

dnephin commented Dec 9, 2023

Expected Behavior

Current Behavior

Steps to Reproduce

Context

About Your Use Case

nicks commented Dec 11, 2023

`tilt ci` does not recreate failed jobs #6283

`tilt ci` does not recreate failed jobs #6283