Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jobs on preempted VMs hang indefinitely until manually cancelled #407

Open
rluckom-guardian opened this issue Jul 5, 2023 · 2 comments
Open
Labels
bug Something isn't working

Comments

@rluckom-guardian
Copy link

Jenkins and plugins versions report

Environment
Paste the output here

What Operating System are you using (both controller, and any agents involved in the problem)?

Ubuntu

Reproduction steps

  1. Create a pipeline where steps are run on preemptible instances
  2. run the pipeline until an instance gets preempted while in the middle of a step

Expected Results

When a node is preempted, jobs / steps on the node should auto-cancel and be restarted elsewhere.

Actual Results

  1. the node is marked as "offline"
  2. the console output shows the preemption handler getting the preemption event;
  3. The step is still treated as "in progress"--it shows with the progress display in the sidebar, the build time clock keeps ticking, etc, despite there being no hope of ever completing

Anything else?

No response

@rluckom-guardian rluckom-guardian added the bug Something isn't working label Jul 5, 2023
@Fabiosilvero
Copy link

Fabiosilvero commented Oct 23, 2023

I also have the same issue with the following implementation :

  1. Jenkins controller
  • OS : Debian 10
  • Java 11
  • Jenkins 2.375.4 (official LTS Docker image + some plugins)
  • google-compute-engine:4.3.16
  • google-metadata-plugin:0.4
  • google-oauth-plugin:1.0.11
  1. Jenkins image template agent on GCE
  • OS : Ubuntu 22.04 LTS
  • Java 11
  • Remoting : managed with SSH (but it is 3077.vd69cf116da_6f)

Jenkinsfile used :

pipeline {
    agent {
        label 'preemptible'
    }
    options {
        timestamps()
    }
    stages {
        stage('Test 1') {
            steps {
                echo 'Hello World'
                sh 'hostname'
                sh '''#!/bin/bash
                    COUNT=0
                    while [ $COUNT -le 100 ] 
                    do
                        echo "Test # $COUNT"
                        sleep 1
                        COUNT=$(( $COUNT + 1 ))
                    done
                '''
            }
        }
        stage('Test 2') {
            steps {
                echo 'Hello Buddy'
                sh 'uname -a'
                sleep time: 5, unit: 'MINUTES'
            }
        }        
    }
}

I simulated a preemption with the following command in the middle of the bash count loop :

gcloud compute instances simulate-maintenance-event --zone europe-west1-b --project my_project the_agent_gce_instance_name

Agent log and build log attached.

Steps to reproduce :

  • Create a pipeline job with the pipeline declarative code above
  • When the job is in the middle of a step (like the bash script), simulate a maintenance event with the above command.
  • The agent is getting the preemption event (as seen in the agent log) but the build is never failed or aborted and can last for days without a manual abort.
  • After the manual abort, the job is relaunched on another agent.

build.log
jenkins_agent.log

@Fabiosilvero
Copy link

Fabiosilvero commented Dec 26, 2023

Hello, any updates on this ? This is quite annoying because jobs hangs and are not restarted elsewhere. Do you need additional logs or troubleshooting ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants