Executor reports task instance (...) finished (failed) although the task says it's queued #39717

andreyvital · 2024-05-20T12:17:49Z

Apache Airflow version

2.9.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

[2024-05-20T12:03:24.184+0000] {task_context_logger.py:91} ERROR - Executor reports task instance
<TaskInstance: (...) scheduled__2024-05-20T11:00:00+00:00 map_index=15 [queued]> 
finished (failed) although the task says it's queued. (Info: None) Was the task killed externally?

What you think should happen instead?

No response

How to reproduce

I am not sure, unfortunately. But every day I see my tasks being killed randomly without good reasoning behind why it got killed/failed.

Operating System

Ubuntu 22.04.4 LTS

Versions of Apache Airflow Providers

apache-airflow==2.9.1
apache-airflow-providers-amazon==8.20.0
apache-airflow-providers-celery==3.6.2
apache-airflow-providers-cncf-kubernetes==8.1.1
apache-airflow-providers-common-io==1.3.1
apache-airflow-providers-common-sql==1.12.0
apache-airflow-providers-docker==3.10.0
apache-airflow-providers-elasticsearch==5.3.4
apache-airflow-providers-fab==1.0.4
apache-airflow-providers-ftp==3.8.0
apache-airflow-providers-google==10.17.0
apache-airflow-providers-grpc==3.4.1
apache-airflow-providers-hashicorp==3.6.4
apache-airflow-providers-http==4.10.1
apache-airflow-providers-imap==3.5.0
apache-airflow-providers-microsoft-azure==10.0.0
apache-airflow-providers-mongo==4.0.0
apache-airflow-providers-mysql==5.5.4
apache-airflow-providers-odbc==4.5.0
apache-airflow-providers-openlineage==1.7.0
apache-airflow-providers-postgres==5.10.2
apache-airflow-providers-redis==3.6.1
apache-airflow-providers-sendgrid==3.4.0
apache-airflow-providers-sftp==4.9.1
apache-airflow-providers-slack==8.6.2
apache-airflow-providers-smtp==1.6.1
apache-airflow-providers-snowflake==5.4.0
apache-airflow-providers-sqlite==3.7.1
apache-airflow-providers-ssh==3.10.1

Deployment

Docker-Compose

Deployment details

Client: Docker Engine - Community
 Version:    26.1.3
 Context:    default
 Debug Mode: false
 Plugins:
  buildx: Docker Buildx (Docker Inc.)
    Version:  v0.14.0
    Path:     /usr/libexec/docker/cli-plugins/docker-buildx
  compose: Docker Compose (Docker Inc.)
    Version:  v2.27.0
    Path:     /usr/libexec/docker/cli-plugins/docker-compose
  scan: Docker Scan (Docker Inc.)
    Version:  v0.23.0
    Path:     /usr/libexec/docker/cli-plugins/docker-scan

Server:
 Containers: 30
  Running: 25
  Paused: 0
  Stopped: 5
 Images: 36
 Server Version: 26.1.3
 Storage Driver: overlay2
  Backing Filesystem: btrfs
  Supports d_type: true
  Using metacopy: false
  Native Overlay Diff: true
  userxattr: false
 Logging Driver: json-file
 Cgroup Driver: systemd
 Cgroup Version: 2
 Plugins:
  Volume: local
  Network: bridge host ipvlan macvlan null overlay
  Log: awslogs fluentd gcplogs gelf journald json-file local splunk syslog
 Swarm: inactive
 Runtimes: io.containerd.runc.v2 runc
 Default Runtime: runc
 Init Binary: docker-init
 containerd version: e377cd56a71523140ca6ae87e30244719194a521
 runc version: v1.1.12-0-g51d5e94
 init version: de40ad0
 Security Options:
  apparmor
  seccomp
   Profile: builtin
  cgroupns
 Kernel Version: 5.15.0-107-generic
 Operating System: Ubuntu 22.04.4 LTS
 OSType: linux
 Architecture: x86_64
 CPUs: 80
 Total Memory: 62.33GiB
 Name: troy
 ID: UFMO:HODB:7MRE:7O2C:FLWN:HE4Y:EZDF:ZGNF:OZRW:BUTZ:DBQK:MFR2
 Docker Root Dir: /var/lib/docker
 Debug Mode: false
 Experimental: false
 Insecure Registries:
  127.0.0.0/8
 Live Restore Enabled: false

OS: Ubuntu 22.04.4 LTS x86_64
Kernel: 5.15.0-107-generic
Uptime: 1 day, 23 hours, 12 mins
Packages: 847 (dpkg), 4 (snap)
Shell: fish 3.7.1
Resolution: 1024x768
Terminal: /dev/pts/0
CPU: Intel Xeon Silver 4316 (80) @ 3.400GHz
GPU: 03:00.0 Matrox Electronics Systems Ltd. Integrated
Memory: 24497MiB / 63830MiB

Anything else?

No response

Are you willing to submit PR?

Yes I am willing to submit a PR!

Code of Conduct

I agree to follow this project's Code of Conduct

The text was updated successfully, but these errors were encountered:

nathadfield · 2024-05-21T13:34:12Z

I'm not sure there's an Airflow issue here.

My initial thought is that you are experiencing issues related to your workers and perhaps they are falling over due to resource issues, i.e. disk, ram?

I can see that you are using dynamic task mapping which, depending on what you are asking the workers to do, how many parallel tasks and the number of workers you have, could be overloading your capacity.

andreyvital · 2024-05-21T15:56:18Z

Not sure...it seems related to redis? I have seen other people report similar issues:

Also, a lot of DAGs are failing within the same reason, so that's not entirely tied to Task Mapping at all. Some tasks fail very early...also this server has a lot of RAM, of which I've granted ~12gb to each worker and the task is very simple, just HTTP requests, all of them run in less than 2 minutes when they don't fail.

RNHTTR · 2024-05-21T19:43:54Z

I think the log you shared (source) erroneously replaced the "stuck in queued" log somehow. Can you check your scheduler logs for "stuck in queued"?

andreyvital · 2024-05-21T23:29:15Z

@RNHTTR there's nothing stating "stuck in queued" on scheduler logs.

nghilethanh-atherlabs · 2024-05-27T04:10:20Z

same issue here

mikolololoay · 2024-05-27T11:22:51Z

I had the same issue when running hundreds of sensors on reschedule mode - a lot of the times they got stuck in the queued status and raised the same error that you posted. It turned out that our redis pod used by Celery restarted quite often and lost the info about queued tasks. Adding persistence to redis seems to have helped. Do you have persistence enabled?

nghilethanh-atherlabs · 2024-05-27T11:25:12Z

I had the same issue when running hundreds of sensors on reschedule mode - a lot of the times they got stuck in the queued status and raised the same error that you posted. It turned out that our redis pod used by Celery restarted quite often and lost the info about queued tasks. Adding persistence to redis seems to have helped. Do you have persistence enabled?

Can you help me how to add this persistence?

andreyvital · 2024-05-27T13:53:24Z

Hi @nghilethanh-atherlabs I've been experimenting with those configs as well:

# airflow.cfg


# https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#task-acks-late
# https://github.com/apache/airflow/issues/16163#issuecomment-1563704852
task_acks_late = False
# https://github.com/apache/airflow/blob/2b6f8ffc69b5f34a1c4ab7463418b91becc61957/airflow/providers/celery/executors/default_celery.py#L52
# https://github.com/celery/celery/discussions/7276#discussioncomment-8720263
# https://github.com/celery/celery/issues/4627#issuecomment-396907957
[celery_broker_transport_options]
visibility_timeout = 300
max_retries = 120
interval_start = 0
interval_step = 0.2
interval_max = 0.5
# sentinel_kwargs = {}

For the redis persistency, you can refer to their config file to enable persistency. Not sure it will sort out. But let's keep trying folks.

# redis.conf
bind 0.0.0.0

protected-mode no

requirepass REDACTED

maxmemory 6gb
# https://redis.io/docs/manual/eviction/
maxmemory-policy noeviction

port 6379

tcp-backlog 511

timeout 0

tcp-keepalive 300

daemonize no
supervised no

pidfile /var/run/redis.pid

loglevel notice

logfile ""

databases 16

always-show-logo no

save 900 1
save 300 10
save 60 10000

stop-writes-on-bgsave-error yes

rdbcompression yes
rdbchecksum yes

dbfilename dump.rdb

dir /bitnami/redis/data

appendonly no
appendfilename "appendonly.aof"
appendfsync everysec
# appendfsync no
no-appendfsync-on-rewrite no
auto-aof-rewrite-percentage 100
auto-aof-rewrite-min-size 64mb
aof-load-truncated yes
aof-use-rdb-preamble no
aof-rewrite-incremental-fsync yes

lua-time-limit 5000

slowlog-log-slower-than 10000
slowlog-max-len 128

latency-monitor-threshold 0
notify-keyspace-events ""

hash-max-ziplist-entries 512
hash-max-ziplist-value 64

list-max-ziplist-size -2
list-compress-depth 0

set-max-intset-entries 512

zset-max-ziplist-entries 128
zset-max-ziplist-value 64

hll-sparse-max-bytes 3000

activerehashing yes

client-output-buffer-limit normal 0 0 0
client-output-buffer-limit slave 256mb 64mb 60
client-output-buffer-limit pubsub 32mb 8mb 60

hz 10

# docker-compose.yml
redis:
  image: bitnami/redis:7.2.5
  container_name: redis
  environment:
    - REDIS_DISABLE_COMMANDS=CONFIG
    # The password will come from the config file, but we need to bypass the validation
    - ALLOW_EMPTY_PASSWORD=yes
  ports:
    - 6379:6379
  # command: /opt/bitnami/scripts/redis/run.sh --maxmemory 2gb
  command: /opt/bitnami/scripts/redis/run.sh
  volumes:
    - ./redis/redis.conf:/opt/bitnami/redis/mounted-etc/redis.conf
    - redis:/bitnami/redis/data
  restart: always
  healthcheck:
    test:
      - CMD
      - redis-cli
      - ping
    interval: 5s
    timeout: 30s
    retries: 10

seanmuth · 2024-05-31T18:19:11Z

Seeing this issue on 2.9.1 as well, also only with sensors.

We've found that the DAG is timing out trying to fill up the Dagbag on the worker. Even with debug logs enabled I don't have a hint about where in the import it's hanging.

[2024-05-31 18:00:01,335: INFO/ForkPoolWorker-63] Filling up the DagBag from <redacted dag file path>
[2024-05-31 18:00:01,350: DEBUG/ForkPoolWorker-63] Importing <redacted dag file path>
[2024-05-31 18:00:31,415: ERROR/ForkPoolWorker-63] Process timed out, PID: 314

On the scheduler the DAG imports in less than a second.

and not all the tasks from this DAG fail to import, many import just fine, at the same time on the same celery worker. below is the same dag file as above, importing fine:

[2024-05-31 18:01:52,911: INFO/ForkPoolWorker-3] Filling up the DagBag from <redacted dag file path>
[2024-05-31 18:01:52,913: DEBUG/ForkPoolWorker-3] Importing <redacted dag file path>
[2024-05-31 18:01:54,232: WARNING/ForkPoolWorker-3] /usr/local/lib/python3.11/site-packages/airflow/models/baseoperator.py:484: RemovedInAirflow3Warning: The 'task_concurrency' parameter is deprecated. Please use 'max_active_tis_per_dag'.
  result = func(self, **kwargs, default_args=default_args)

[2024-05-31 18:01:54,272: DEBUG/ForkPoolWorker-3] Loaded DAG <DAG: redacted dag>

one caveat/note is that it looks like the 2nd run/retry of each sensor is what runs just fine.

We've also confirmed this behavior was not present on Airflow 2.7.3, and only started occurring since upgrading to 2.9.1.

nghilethanh-atherlabs · 2024-06-01T06:05:15Z

@andreyvital thank you so much for your response. I have setup and it works really great :)

petervanko · 2024-06-01T20:48:54Z

I was working on the issue with @seanmuth and increasing parsing time solved the issue.
It does not fix the root cause, but as a workaround it can save your night...

AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT = 120

Lee-W · 2024-06-05T01:33:55Z

Hello everyone,

I'm currently investigating this issue, but I haven't been able to replicate it yet. Could you please try setting AIRFLOW__CORE__EXECUTE_TASKS_NEW_PYTHON_INTERPRETER=True [1] to see if we can generate more error logs? It seems that _execute_in_subprocess generates more error logs compared to _execute_in_fork, which might provide us with some additional clues.

airflow/airflow/providers/celery/executors/celery_executor_utils.py

Lines 187 to 188 in 2d53c10

    
           log.exception("[%s] execute_command encountered a CalledProcessError", celery_task_id) 
        
           log.error(e.output)

[1] https://airflow.apache.org/docs/apache-airflow/stable/configurations-ref.html#execute-tasks-new-python-interpreter

niegowic · 2024-06-07T08:43:16Z

Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so it's strictly related with this version. It happens randomly on random task execution. Restarting scheduler and triggerer helps - but this is our temp workaround.

Lee-W · 2024-06-11T09:45:03Z

Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so it's strictly related with this version. It happens randomly on random task execution. Restarting scheduler and triggerer helps - but this is our temp workaround.

Ｗe've released apache-airflow-providers-celery 3.7.2 with enhanced logging. Could you please update the provider version and check the debug log for any clues? Additionally, what I mentioned in #39717 (comment) might give us some club as well. Thanks!

andreyvital added area:core kind:bug This is a clearly a bug needs-triage label for new issues that we didn't triage yet labels May 20, 2024

nathadfield added the pending-response label May 21, 2024

RNHTTR removed the needs-triage label for new issues that we didn't triage yet label May 21, 2024

RNHTTR removed the pending-response label Jun 1, 2024

eladkal added area:Scheduler Scheduler or dag parsing Issues affected_version:2.9 labels Jun 8, 2024

eladkal assigned Lee-W Jun 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Executor reports task instance (...) finished (failed) although the task says it's queued #39717

Executor reports task instance (...) finished (failed) although the task says it's queued #39717

andreyvital commented May 20, 2024

nathadfield commented May 21, 2024

andreyvital commented May 21, 2024 •

edited

RNHTTR commented May 21, 2024

andreyvital commented May 21, 2024

nghilethanh-atherlabs commented May 27, 2024 •

edited

mikolololoay commented May 27, 2024

nghilethanh-atherlabs commented May 27, 2024

andreyvital commented May 27, 2024 •

edited

seanmuth commented May 31, 2024

nghilethanh-atherlabs commented Jun 1, 2024

petervanko commented Jun 1, 2024

Lee-W commented Jun 5, 2024

niegowic commented Jun 7, 2024

Lee-W commented Jun 11, 2024

Executor reports task instance (...) finished (failed) although the task says it's queued #39717

Executor reports task instance (...) finished (failed) although the task says it's queued #39717

Comments

andreyvital commented May 20, 2024

Apache Airflow version

If "Other Airflow 2 version" selected, which one?

What happened?

What you think should happen instead?

How to reproduce

Operating System

Versions of Apache Airflow Providers

Deployment

Deployment details

Anything else?

Are you willing to submit PR?

Code of Conduct

nathadfield commented May 21, 2024

andreyvital commented May 21, 2024 • edited

RNHTTR commented May 21, 2024

andreyvital commented May 21, 2024

nghilethanh-atherlabs commented May 27, 2024 • edited

mikolololoay commented May 27, 2024

nghilethanh-atherlabs commented May 27, 2024

andreyvital commented May 27, 2024 • edited

seanmuth commented May 31, 2024

nghilethanh-atherlabs commented Jun 1, 2024

petervanko commented Jun 1, 2024

Lee-W commented Jun 5, 2024

niegowic commented Jun 7, 2024

Lee-W commented Jun 11, 2024

andreyvital commented May 21, 2024 •

edited

nghilethanh-atherlabs commented May 27, 2024 •

edited

andreyvital commented May 27, 2024 •

edited