Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096

paramjeet01 · 2024-04-17T22:15:03Z

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

The airflow terminated the task, scheduling it for a retry. However, during the subsequent retry attempt, an error occurred indicating that the pod with identical labels still persisted. Upon inspection, I found the pods were still active from the initial attempt.

First attempt error log:

[2024-04-18, 01:30:40 IST] {local_task_job_runner.py:121} ERROR - Received SIGTERM. Terminating subprocesses
[2024-04-18, 01:30:40 IST] {process_utils.py:131} INFO - Sending 15 to group 125. PIDs of all processes in the group: [125]
[2024-04-18, 01:30:40 IST] {process_utils.py:86} INFO - Sending the signal 15 to group 125
[2024-04-18, 01:30:40 IST] {taskinstance.py:2483} ERROR - Received SIGTERM. Terminating subprocesses.
[2024-04-18, 01:30:40 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 157, in execute
    self.pod_manager.fetch_requested_container_logs(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 580, in fetch_requested_container_logs
    status = self.fetch_container_logs(pod=pod, container_name=c, follow=follow_logs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 503, in fetch_container_logs
    last_log_time, exc = consume_logs(since_time=last_log_time)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 450, in consume_logs
    for raw_line in logs:
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 240, in __iter__
    for data_chunk in self.response.stream(amt=None, decode_content=True):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 933, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1073, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1001, in _update_chunk_length
    line = self._fp.fp.readline()  # type: ignore[union-attr]
  File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.10/ssl.py", line 1307, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.10/ssl.py", line 1163, in read
    return self._sslobj.read(len, buffer)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 2485, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: Task received SIGTERM signal
[2024-04-18, 01:30:40 IST] {taskinstance.py:527} DEBUG - Task Duration set to 10.25225
[2024-04-18, 01:30:40 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:30:40 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY.

Second attempt error log:

[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
    name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'
[2024-04-18, 01:32:20 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 186, in execute
    self.cleanup(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 839, in cleanup
    raise AirflowException(
airflow.exceptions.AirflowException: Pod { *** our pod name ***} returned a failure.
remote_pod: None
[2024-04-18, 01:32:20 IST] {taskinstance.py:527} DEBUG - Task Duration set to 1.162314
[2024-04-18, 01:32:20 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:32:20 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY. 
[2024-04-18, 01:32:20 IST] {plugins.py:178} INFO - Getting Refined Message
[2024-04-18, 01:32:20 IST] {plugins.py:180} INFO - Message Payload Not Provided
[2024-04-18, 01:32:20 IST] {logging_mixin.py:188} INFO - {'Airflow Exception': 'Pod Launching failed'}
[2024-04-18, 01:32:20 IST] {plugins.py:183} INFO - Task message: {'Airflow Exception': 'Pod Launching failed'}

What you think should happen instead?

Once the SIGTERM Terminating subprocesses is issued to the task it should properly delete the pod.

How to reproduce

Let airflow kill your task with SIGTERM and on the next retry you'll face pod already exists with same labels

Operating System

Amazon Linux 2

Versions of Apache Airflow Providers

pytest>=6.2.5
docker>=5.0.0
crypto>=1.4.1
cryptography>=3.4.7
pyOpenSSL>=20.0.1
ndg-httpsclient>=0.5.1
boto3>=1.34.0
sqlalchemy
redis>=3.5.3
requests>=2.26.0
pysftp>=0.2.9
werkzeug>=1.0.1
apache-airflow-providers-cncf-kubernetes==8.0.0
apache-airflow-providers-amazon>=8.13.0
psycopg2>=2.8.5
grpcio>=1.37.1
grpcio-tools>=1.37.1
protobuf>=3.15.8,<=3.21
python-dateutil>=2.8.2
jira>=3.1.1
confluent_kafka>=1.7.0
pyarrow>=10.0.1,<10.1.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Official helm chart deployment

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

paramjeet01 · 2024-04-21T15:17:38Z

After debugging for a while , I have also found that one of our scheduler was failed with liveliness probe and restarted at the same time. So , I'm guessing that the scheduler restart caused SIGTERM killing and the task wasn't adopted by any schedulers.

paramjeet01 · 2024-05-03T15:30:29Z

Found this log in airflow scheduler:

[2024-05-03T12:09:06.476+0000] {process_utils.py:100} INFO - Sending the signal Signals.SIGTERM to process 75 as process group is missing.
[2024-05-03T12:09:05.937+0000] {process_utils.py:131} INFO - Sending Signals.SIGTERM to group 75. PIDs of all processes in the group: [42184, 42242, 42243, 42244, 42245, 42265, 75]

paramjeet01 · 2024-05-18T09:12:51Z

I believe I have identified the cause of the issue:

We are using AWS Spot EC2 instances for the workloads in Airflow. When a spot instance is terminated, the pod enters a terminating state for around 2 minutes. During the second retry, the pod is rescheduled, and the find_pod method is used to retrieve the pod based on the labels, which results in the following error:

[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}

At this point, we have a pod in a terminating state and a new pod created by the second retry. When the cleanup method is called, it encounters another error because the find_pod method did not return anything due to the exception:

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
    name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'

After every retry a new pod is created and not cleaned up which loops forever.

paramjeet01 · 2024-05-18T13:55:49Z

I hope this is solved in latest version with this PR : https://github.com/apache/airflow/pull/37671/files
@dirrao , can you confirm the above PR is to solve this issue

dirrao · 2024-05-19T05:17:18Z

@paramjeet01
Kubernetes executor and pod operator has gone multiple improvements in the past. I would request you try the latest versions and re-confirm the issue.

paramjeet01 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 17, 2024

RNHTTR removed the needs-triage label for new issues that we didn't triage yet label Apr 18, 2024

paramjeet01 mentioned this issue Apr 21, 2024

Many running task instances are cleared by the new scheduler when an old scheduler is terminated and its health check server is periodically requested #39088

Closed

2 tasks

paramjeet01 linked a pull request May 18, 2024 that will close this issue

Updated kuberentes pod operator to find pods only in running state #39702

Open

eladkal added area:providers pending-response provider:cncf-kubernetes Kubernetes provider related issues and removed area:core labels Jun 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096

Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096

paramjeet01 commented Apr 17, 2024

paramjeet01 commented Apr 21, 2024

paramjeet01 commented May 3, 2024

paramjeet01 commented May 18, 2024

paramjeet01 commented May 18, 2024

dirrao commented May 19, 2024

Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096

Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096

Comments

paramjeet01 commented Apr 17, 2024

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

paramjeet01 commented Apr 21, 2024

paramjeet01 commented May 3, 2024

paramjeet01 commented May 18, 2024

paramjeet01 commented May 18, 2024

dirrao commented May 19, 2024