-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Executor reports task instance (...) finished (failed) although the task says it's queued #39717
Comments
I'm not sure there's an Airflow issue here. My initial thought is that you are experiencing issues related to your workers and perhaps they are falling over due to resource issues, i.e. disk, ram? I can see that you are using dynamic task mapping which, depending on what you are asking the workers to do, how many parallel tasks and the number of workers you have, could be overloading your capacity. |
Not sure...it seems related to redis? I have seen other people report similar issues:
Also, a lot of DAGs are failing within the same reason, so that's not entirely tied to Task Mapping at all. Some tasks fail very early...also this server has a lot of RAM, of which I've granted ~12gb to each worker and the task is very simple, just HTTP requests, all of them run in less than 2 minutes when they don't fail. |
I think the log you shared (source) erroneously replaced the "stuck in queued" log somehow. Can you check your scheduler logs for "stuck in queued"? |
@RNHTTR there's nothing stating "stuck in queued" on scheduler logs. |
same issue here |
I had the same issue when running hundreds of sensors on reschedule mode - a lot of the times they got stuck in the queued status and raised the same error that you posted. It turned out that our redis pod used by Celery restarted quite often and lost the info about queued tasks. Adding persistence to redis seems to have helped. Do you have persistence enabled? |
Can you help me how to add this persistence? |
Hi @nghilethanh-atherlabs I've been experimenting with those configs as well: # airflow.cfg
# https://airflow.apache.org/docs/apache-airflow-providers-celery/stable/configurations-ref.html#task-acks-late
# https://github.com/apache/airflow/issues/16163#issuecomment-1563704852
task_acks_late = False
# https://github.com/apache/airflow/blob/2b6f8ffc69b5f34a1c4ab7463418b91becc61957/airflow/providers/celery/executors/default_celery.py#L52
# https://github.com/celery/celery/discussions/7276#discussioncomment-8720263
# https://github.com/celery/celery/issues/4627#issuecomment-396907957
[celery_broker_transport_options]
visibility_timeout = 300
max_retries = 120
interval_start = 0
interval_step = 0.2
interval_max = 0.5
# sentinel_kwargs = {} For the redis persistency, you can refer to their config file to enable persistency. Not sure it will sort out. But let's keep trying folks.
# docker-compose.yml
redis:
image: bitnami/redis:7.2.5
container_name: redis
environment:
- REDIS_DISABLE_COMMANDS=CONFIG
# The password will come from the config file, but we need to bypass the validation
- ALLOW_EMPTY_PASSWORD=yes
ports:
- 6379:6379
# command: /opt/bitnami/scripts/redis/run.sh --maxmemory 2gb
command: /opt/bitnami/scripts/redis/run.sh
volumes:
- ./redis/redis.conf:/opt/bitnami/redis/mounted-etc/redis.conf
- redis:/bitnami/redis/data
restart: always
healthcheck:
test:
- CMD
- redis-cli
- ping
interval: 5s
timeout: 30s
retries: 10 |
Seeing this issue on 2.9.1 as well, also only with sensors. We've found that the DAG is timing out trying to fill up the Dagbag on the worker. Even with debug logs enabled I don't have a hint about where in the import it's hanging.
On the scheduler the DAG imports in less than a second. and not all the tasks from this DAG fail to import, many import just fine, at the same time on the same celery worker. below is the same dag file as above, importing fine:
one caveat/note is that it looks like the 2nd run/retry of each sensor is what runs just fine. We've also confirmed this behavior was not present on Airflow 2.7.3, and only started occurring since upgrading to 2.9.1. |
@andreyvital thank you so much for your response. I have setup and it works really great :) |
I was working on the issue with @seanmuth and increasing parsing time solved the issue.
|
Hello everyone, I'm currently investigating this issue, but I haven't been able to replicate it yet. Could you please try setting airflow/airflow/providers/celery/executors/celery_executor_utils.py Lines 187 to 188 in 2d53c10
|
Spotted same problem with Airflow 2.9.1 - problem didn't occur earlier so it's strictly related with this version. It happens randomly on random task execution. Restarting scheduler and triggerer helps - but this is our temp workaround. |
We've released apache-airflow-providers-celery 3.7.2 with enhanced logging. Could you please update the provider version and check the debug log for any clues? Additionally, what I mentioned in #39717 (comment) might give us some club as well. Thanks! |
Apache Airflow version
2.9.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
What you think should happen instead?
No response
How to reproduce
I am not sure, unfortunately. But every day I see my tasks being killed randomly without good reasoning behind why it got killed/failed.
Operating System
Ubuntu 22.04.4 LTS
Versions of Apache Airflow Providers
Deployment
Docker-Compose
Deployment details
Anything else?
No response
Are you willing to submit PR?
Code of Conduct
The text was updated successfully, but these errors were encountered: