Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096

Open
1 of 2 tasks
paramjeet01 opened this issue Apr 17, 2024 · 5 comments · May be fixed by #39702
Open
1 of 2 tasks

Pods are still running even after receiveing SIGTERM Terminating subprocesses #39096

paramjeet01 opened this issue Apr 17, 2024 · 5 comments · May be fixed by #39702
Labels
area:providers kind:bug This is a clearly a bug pending-response provider:cncf-kubernetes Kubernetes provider related issues

Comments

@paramjeet01
Copy link

Apache Airflow version

Other Airflow 2 version (please specify below)

If "Other Airflow 2 version" selected, which one?

2.8.3

What happened?

The airflow terminated the task, scheduling it for a retry. However, during the subsequent retry attempt, an error occurred indicating that the pod with identical labels still persisted. Upon inspection, I found the pods were still active from the initial attempt.

First attempt error log:

[2024-04-18, 01:30:40 IST] {local_task_job_runner.py:121} ERROR - Received SIGTERM. Terminating subprocesses
[2024-04-18, 01:30:40 IST] {process_utils.py:131} INFO - Sending 15 to group 125. PIDs of all processes in the group: [125]
[2024-04-18, 01:30:40 IST] {process_utils.py:86} INFO - Sending the signal 15 to group 125
[2024-04-18, 01:30:40 IST] {taskinstance.py:2483} ERROR - Received SIGTERM. Terminating subprocesses.
[2024-04-18, 01:30:40 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 157, in execute
    self.pod_manager.fetch_requested_container_logs(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 580, in fetch_requested_container_logs
    status = self.fetch_container_logs(pod=pod, container_name=c, follow=follow_logs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 503, in fetch_container_logs
    last_log_time, exc = consume_logs(since_time=last_log_time)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 450, in consume_logs
    for raw_line in logs:
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 240, in __iter__
    for data_chunk in self.response.stream(amt=None, decode_content=True):
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 933, in stream
    yield from self.read_chunked(amt, decode_content=decode_content)
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1073, in read_chunked
    self._update_chunk_length()
  File "/home/airflow/.local/lib/python3.10/site-packages/urllib3/response.py", line 1001, in _update_chunk_length
    line = self._fp.fp.readline()  # type: ignore[union-attr]
  File "/usr/local/lib/python3.10/socket.py", line 705, in readinto
    return self._sock.recv_into(b)
  File "/usr/local/lib/python3.10/ssl.py", line 1307, in recv_into
    return self.read(nbytes, buffer)
  File "/usr/local/lib/python3.10/ssl.py", line 1163, in read
    return self._sslobj.read(len, buffer)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 2485, in signal_handler
    raise AirflowException("Task received SIGTERM signal")
airflow.exceptions.AirflowException: Task received SIGTERM signal
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: Task received SIGTERM signal
[2024-04-18, 01:30:40 IST] {taskinstance.py:527} DEBUG - Task Duration set to 10.25225
[2024-04-18, 01:30:40 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:30:40 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY. 

Second attempt error log:

[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
    name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'
[2024-04-18, 01:32:20 IST] {taskinstance.py:2731} ERROR - Task failed with exception
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 184, in execute
    raise AirflowException(f'Pod Launching failed: {ex}')
airflow.exceptions.AirflowException: Pod Launching failed: {**** our labels *****}
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 439, in _execute_task
    result = _execute_callable(context=context, **execute_callable_kwargs)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/models/taskinstance.py", line 414, in _execute_callable
    return execute_callable(context=context, **execute_callable_kwargs)
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 186, in execute
    self.cleanup(
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 839, in cleanup
    raise AirflowException(
airflow.exceptions.AirflowException: Pod { *** our pod name ***} returned a failure.
remote_pod: None
[2024-04-18, 01:32:20 IST] {taskinstance.py:527} DEBUG - Task Duration set to 1.162314
[2024-04-18, 01:32:20 IST] {taskinstance.py:549} DEBUG - Clearing next_method and next_kwargs.
[2024-04-18, 01:32:20 IST] {taskinstance.py:1149} INFO - Marking task as UP_FOR_RETRY. 
[2024-04-18, 01:32:20 IST] {plugins.py:178} INFO - Getting Refined Message
[2024-04-18, 01:32:20 IST] {plugins.py:180} INFO - Message Payload Not Provided
[2024-04-18, 01:32:20 IST] {logging_mixin.py:188} INFO - {'Airflow Exception': 'Pod Launching failed'}
[2024-04-18, 01:32:20 IST] {plugins.py:183} INFO - Task message: {'Airflow Exception': 'Pod Launching failed'}

What you think should happen instead?

Once the SIGTERM Terminating subprocesses is issued to the task it should properly delete the pod.

How to reproduce

Let airflow kill your task with SIGTERM and on the next retry you'll face pod already exists with same labels

Operating System

Amazon Linux 2

Versions of Apache Airflow Providers

pytest>=6.2.5
docker>=5.0.0
crypto>=1.4.1
cryptography>=3.4.7
pyOpenSSL>=20.0.1
ndg-httpsclient>=0.5.1
boto3>=1.34.0
sqlalchemy
redis>=3.5.3
requests>=2.26.0
pysftp>=0.2.9
werkzeug>=1.0.1
apache-airflow-providers-cncf-kubernetes==8.0.0
apache-airflow-providers-amazon>=8.13.0
psycopg2>=2.8.5
grpcio>=1.37.1
grpcio-tools>=1.37.1
protobuf>=3.15.8,<=3.21
python-dateutil>=2.8.2
jira>=3.1.1
confluent_kafka>=1.7.0
pyarrow>=10.0.1,<10.1.0

Deployment

Official Apache Airflow Helm Chart

Deployment details

Official helm chart deployment

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

@paramjeet01 paramjeet01 added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels Apr 17, 2024
@RNHTTR RNHTTR removed the needs-triage label for new issues that we didn't triage yet label Apr 18, 2024
@paramjeet01
Copy link
Author

After debugging for a while , I have also found that one of our scheduler was failed with liveliness probe and restarted at the same time. So , I'm guessing that the scheduler restart caused SIGTERM killing and the task wasn't adopted by any schedulers.

@paramjeet01
Copy link
Author

Found this log in airflow scheduler:

[2024-05-03T12:09:06.476+0000] {process_utils.py:100} INFO - Sending the signal Signals.SIGTERM to process 75 as process group is missing.
[2024-05-03T12:09:05.937+0000] {process_utils.py:131} INFO - Sending Signals.SIGTERM to group 75. PIDs of all processes in the group: [42184, 42242, 42243, 42244, 42245, 42265, 75]

@paramjeet01
Copy link
Author

I believe I have identified the cause of the issue:

We are using AWS Spot EC2 instances for the workloads in Airflow. When a spot instance is terminated, the pod enters a terminating state for around 2 minutes. During the second retry, the pod is rescheduled, and the find_pod method is used to retrieve the pod based on the labels, which results in the following error:

[2024-04-18, 01:32:20 IST] {pod.py:1109} ERROR - 'NoneType' object has no attribute 'metadata'
Traceback (most recent call last):
  File "/opt/airflow/plugins/operators/kubernetes_pod_operator.py", line 153, in execute
    self.remote_pod = self.find_pod(self.pod.metadata.namespace, context=context)
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 523, in find_pod
    raise AirflowException(f"More than one pod running with labels {label_selector}")
airflow.exceptions.AirflowException: More than one pod running with labels {**** our labels *****}

At this point, we have a pod in a terminating state and a new pod created by the second retry. When the cleanup method is called, it encounters another error because the find_pod method did not return anything due to the exception:

During handling of the above exception, another exception occurred:
Traceback (most recent call last):
  File "/home/airflow/.local/lib/python3.10/site-packages/airflow/providers/cncf/kubernetes/operators/pod.py", line 937, in patch_already_checked
    name=pod.metadata.name,
AttributeError: 'NoneType' object has no attribute 'metadata'

After every retry a new pod is created and not cleaned up which loops forever.

@paramjeet01
Copy link
Author

I hope this is solved in latest version with this PR : https://github.com/apache/airflow/pull/37671/files
@dirrao , can you confirm the above PR is to solve this issue

@dirrao
Copy link
Collaborator

dirrao commented May 19, 2024

@paramjeet01
Kubernetes executor and pod operator has gone multiple improvements in the past. I would request you try the latest versions and re-confirm the issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:bug This is a clearly a bug pending-response provider:cncf-kubernetes Kubernetes provider related issues
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants