-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096
Comments
After debugging for a while , I have also found that one of our scheduler was failed with liveliness probe and restarted at the same time. So , I'm guessing that the scheduler restart caused SIGTERM killing and the task wasn't adopted by any schedulers. |
Found this log in airflow scheduler:
|
I believe I have identified the cause of the issue: We are using AWS Spot EC2 instances for the workloads in Airflow. When a spot instance is terminated, the pod enters a terminating state for around 2 minutes. During the second retry, the pod is rescheduled, and the find_pod method is used to retrieve the pod based on the labels, which results in the following error:
At this point, we have a pod in a terminating state and a new pod created by the second retry. When the cleanup method is called, it encounters another error because the find_pod method did not return anything due to the exception:
After every retry a new pod is created and not cleaned up which loops forever. |
I hope this is solved in latest version with this PR : https://github.com/apache/airflow/pull/37671/files |
@paramjeet01 |
Apache Airflow version
Other Airflow 2 version (please specify below)
If "Other Airflow 2 version" selected, which one?
2.8.3
What happened?
The airflow terminated the task, scheduling it for a retry. However, during the subsequent retry attempt, an error occurred indicating that the pod with identical labels still persisted. Upon inspection, I found the pods were still active from the initial attempt.
First attempt error log:
Second attempt error log:
What you think should happen instead?
Once the SIGTERM Terminating subprocesses is issued to the task it should properly delete the pod.
How to reproduce
Let airflow kill your task with SIGTERM and on the next retry you'll face pod already exists with same labels
Operating System
Amazon Linux 2
Versions of Apache Airflow Providers
pytest>=6.2.5
docker>=5.0.0
crypto>=1.4.1
cryptography>=3.4.7
pyOpenSSL>=20.0.1
ndg-httpsclient>=0.5.1
boto3>=1.34.0
sqlalchemy
redis>=3.5.3
requests>=2.26.0
pysftp>=0.2.9
werkzeug>=1.0.1
apache-airflow-providers-cncf-kubernetes==8.0.0
apache-airflow-providers-amazon>=8.13.0
psycopg2>=2.8.5
grpcio>=1.37.1
grpcio-tools>=1.37.1
protobuf>=3.15.8,<=3.21
python-dateutil>=2.8.2
jira>=3.1.1
confluent_kafka>=1.7.0
pyarrow>=10.0.1,<10.1.0
Deployment
Official Apache Airflow Helm Chart
Deployment details
Official helm chart deployment
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: