Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tilt ci does not recreate failed jobs #6283

Open
dnephin opened this issue Dec 9, 2023 · 1 comment
Open

tilt ci does not recreate failed jobs #6283

dnephin opened this issue Dec 9, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@dnephin
Copy link
Contributor

dnephin commented Dec 9, 2023

Expected Behavior

When tilt ci runs, if it finds a job already exists (and the pod-template-hash matches) but the job has never completed (all pods exited with error) it would recreate the job to start it again.

I'm not sure if this is a bug or a feature request, because I can't find any docs that say it should work this way. Maybe I'm assuming something based on how tilt up behaves?

Current Behavior

tilt seems to attach to an arbitrary pod that has already terminated (it does not appear to be the most recent or the oldest pod). The output in the logs is:

Attaching to existing pod (db-init-cnbxf). Only new logs will be streamed.

Then tilt ci exits immediately with error: Error: Pod "db-init-cnbxf" failed.

Steps to Reproduce

  1. Given these files:

    script.sh

    #!/usr/bin/env sh 
    date
    echo some output
    exit 1

    Tiltfile

    load('ext://deployment', 'job_create')
    docker_build('db-init', '.', dockerfile_contents="""
    FROM busybox
    COPY script.sh script.sh
    ENTRYPOINT /script.sh
    """)
    job_create('db-init')
  2. Run tilt ci once, and the output from this job is printed with the date.

  3. Run tilt ci again many times and the pod never runs again (the job controller will recreate the pod occasionally if the spec allows it). tilt ci says its' attaching to the terminated pod, then exits.

Context

About Your Use Case

We create environments for CI ahead of time using tilt ci. When one of those fails due to some flaky test or infrastructure problem we attempt to retry with tilt ci. We've noticed those retries don't end up working most of the time because of this behaviour.

@dnephin dnephin added the bug Something isn't working label Dec 9, 2023
@dnephin dnephin changed the title tilt ci does not always re-run failed jobs when resuming tilt ci does not recreate failed jobs Dec 9, 2023
@nicks
Copy link
Member

nicks commented Dec 11, 2023

oof, this is tricky. The short version is that this is currently working as designed.

re: "I can't find any docs that say it should work this way" - here's a good doc on tilt's execution model - https://docs.tilt.dev/controlloop. Basically, you can think of it as docker build && kubectl apply && kubectl wait. tilt ci mainly adds exit conditions.

The fundamental problem is that if you kubectl apply a job, and the spec of the Job hasn't changed, then (from Kubernetes' perspective), there's no reason to re-run the job. From the apiserver perspective, the whole contract of apply if that if the spec of an object hasn't changed, then the system should do nothing.

Tilt inherits this behavior -- if the Job hasn't changed, then the Job shouldn't be re-run.

There have been discussions of this over the years (e.g., kubernetes/kubernetes#77396), but lots of stuff relies on this behavior.

I guess the simple workaround right now is to add to your tiltfile like:

if config.tilt_subcommand == 'ci'):
  local('./clean-up-old-jobs.sh')

though i agree that's unsatisfying :(

@nicks nicks added enhancement New feature or request and removed bug Something isn't working labels Dec 11, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants