Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not recognizing worker in celery-exporter logs. #262

Open
mksmkzk opened this issue Aug 3, 2023 · 10 comments
Open

Not recognizing worker in celery-exporter logs. #262

mksmkzk opened this issue Aug 3, 2023 · 10 comments

Comments

@mksmkzk
Copy link

mksmkzk commented Aug 3, 2023

Hello, I am building out a monitoring stack for a django project. I am using celery, redis, and flower. Everything is running in docker containers.

The worker is recognized through flower, but when I check the logs for the celery exporter, I get this message

2023-08-03 15:54:22 2023-08-03 22:54:22.160 | INFO | src.exporter:track_timed_out_workers:188 - Have not seen cc2ae99afb58 for 25201.57 seconds. Removing from metrics

The tasks are being sent and recorded correctly in flower, but when I do the celery_tasks_received_total metric, none of them are registering.

I have events enables and have the worker sending out a heartbeat every 10 seconds. celery-exporter connects to the broker. I am not sure what is going on or how to even approach this anymore. Hoping someone could guide me in the right direction.

@danihodovic
Copy link
Owner

@adinhodovic

@adinhodovic
Copy link
Collaborator

adinhodovic commented Aug 6, 2023

The cleanup of metrics happens when workers are timed out, it helps to lower label cardinality but this can be disabled. However, it happens by default after a timeout of 5 minutes, you do not see any metrics in your prometheus tsdb?

@mksmkzk
Copy link
Author

mksmkzk commented Aug 6, 2023

The timeout happens for me on startup and going forward. I have access to some metrics but not all of them. This is what i see from the metrics endpoint. But even the metrics that are being sent is not including the complete list of tasks which should be firing.

# HELP celery_task_sent_total Sent when a task message is published.
# TYPE celery_task_sent_total counter
celery_task_sent_total{hostname="76a4d3cbc22a",name="scheduler.tasks.send_unresponded_notifications",queue_name="celery"} 390.0
# HELP celery_task_sent_created Sent when a task message is published.
# TYPE celery_task_sent_created gauge
celery_task_sent_created{hostname="76a4d3cbc22a",name="scheduler.tasks.send_unresponded_notifications",queue_name="celery"} 1.691126270735308e+09
# HELP celery_task_received_total Sent when the worker receives a task.
# TYPE celery_task_received_total counter
# HELP celery_task_started_total Sent just before the worker executes the task.
# TYPE celery_task_started_total counter
# HELP celery_task_succeeded_total Sent if the task executed successfully.
# TYPE celery_task_succeeded_total counter
# HELP celery_task_failed_total Sent if the execution of the task failed.
# TYPE celery_task_failed_total counter
# HELP celery_task_rejected_total The task was rejected by the worker, possibly to be re-queued or moved to a dead letter queue.
# TYPE celery_task_rejected_total counter
# HELP celery_task_revoked_total Sent if the task has been revoked.
# TYPE celery_task_revoked_total counter
# HELP celery_task_retried_total Sent if the task failed, but will be retried in the future.
# TYPE celery_task_retried_total counter
# HELP celery_worker_up Indicates if a worker has recently sent a heartbeat.
# TYPE celery_worker_up gauge
celery_worker_up{hostname="6a2aba40f48b"} 1.0
# HELP celery_worker_tasks_active The number of tasks the worker is currently processing
# TYPE celery_worker_tasks_active gauge
celery_worker_tasks_active{hostname="6a2aba40f48b"} 0.0
# HELP celery_task_runtime Histogram of task runtime measurements.
# TYPE celery_task_runtime histogram
# HELP celery_queue_length The number of message in broker queue.
# TYPE celery_queue_length gauge
celery_queue_length{queue_name="celery"} 0.0
# HELP celery_active_consumer_count The number of active consumer in broker queue.
# TYPE celery_active_consumer_count gauge
celery_active_consumer_count{queue_name="celery"} 1.0
# HELP celery_active_worker_count The number of active workers in broker queue.
# TYPE celery_active_worker_count gauge
celery_active_worker_count{queue_name="celery"} 1.0
# HELP celery_active_process_count The number of active processes in broker queue.
# TYPE celery_active_process_count gauge
celery_active_process_count{queue_name="celery"} 4.0

@adinhodovic
Copy link
Collaborator

adinhodovic commented Aug 6, 2023

Are your workers picking up the tasks and executing them? The metrics above just indicate that the client has sent tasks to the queue (note celery_task_sent_total), but actually none of the metrics that indicate that work is being done is there e.g celery_task_received_total (worker has received the task).

Enabled metrics for the workers?

# In celeryconfig.py
worker_send_task_events = True

@mksmkzk
Copy link
Author

mksmkzk commented Aug 6, 2023

Yeah I have metrics enabled for the worker and the tasks are being executed. The tasks are being picked up correctly through flower as well.

@adinhodovic
Copy link
Collaborator

Try running it with debug logging.

@mksmkzk
Copy link
Author

mksmkzk commented Aug 9, 2023

@adinhodovic

These are the logs. It registers the heartbeat but still times out the worker. I have the workers sending a heartbeat every 10 seconds.

`
2023-08-08 16:48:25 2023-08-08 23:48:25.805 | DEBUG | src.exporter:track_worker_heartbeat:316 - Received event='worker-heartbeat' for worker='7e7bf8bb1670'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:324 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:327 - Updated gauge='celery_worker_up' value='1'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | INFO | src.exporter:track_timed_out_workers:188 - Have not seen 7e7bf8bb1670 for 25200.56 seconds. Removing from metrics

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:147 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:150 - Updated gauge='celery_worker_up' value='0'
`

The individual tasks also end up getting registered in the statistics, but the metrics aren't showing up correctly.

`
2023-08-08 16:48:47 2023-08-08 23:48:47.249 | DEBUG | src.exporter:track_task_event:257 - Received event='task-started' for task='scheduler.tasks.send_unresponded_notifications'

2023-08-08 16:48:47 2023-08-08 23:48:47.277 | DEBUG | src.exporter:track_task_event:257 - Received event='task-sent' for task='scheduler.tasks.send_notification'
`

I did notice that the everytime it does time out the worker, the time is always 25000+ seconds. I'm curious if that might be the issue.

@adinhodovic
Copy link
Collaborator

adinhodovic commented Aug 9, 2023

@adinhodovic

These are the logs. It registers the heartbeat but still times out the worker. I have the workers sending a heartbeat every 10 seconds.

` 2023-08-08 16:48:25 2023-08-08 23:48:25.805 | DEBUG | src.exporter:track_worker_heartbeat:316 - Received event='worker-heartbeat' for worker='7e7bf8bb1670'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:324 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:327 - Updated gauge='celery_worker_up' value='1'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | INFO | src.exporter:track_timed_out_workers:188 - Have not seen 7e7bf8bb1670 for 25200.56 seconds. Removing from metrics

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:147 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:150 - Updated gauge='celery_worker_up' value='0' `

The individual tasks also end up getting registered in the statistics, but the metrics aren't showing up correctly.

` 2023-08-08 16:48:47 2023-08-08 23:48:47.249 | DEBUG | src.exporter:track_task_event:257 - Received event='task-started' for task='scheduler.tasks.send_unresponded_notifications'

2023-08-08 16:48:47 2023-08-08 23:48:47.277 | DEBUG | src.exporter:track_task_event:257 - Received event='task-sent' for task='scheduler.tasks.send_notification' `

I did notice that the everytime it does time out the worker, the time is always 25000+ seconds. I'm curious if that might be the issue.

It seems to be working fine, events are received and metrics seem to be generated. Why your worker times out, I'm not sure of - but nothing to do with the exporter (the exporter just marks it not up and cleans up the metrics for it). I'd still expect metrics in your Prometheus TSDB if you are scraping the exporter.

@pedroits
Copy link

pedroits commented Oct 9, 2023

I run everything in my environment inside containers, so for me, it had something to do with timezone.
In my case all workers had 14400 seconds "not responding", not by coincidence that's my timezone shift to GMT.

src.exporter.track_timed_out_workers:

    def track_timed_out_workers(self):
        now = time.time()
        # Make a copy of the last seen dict so we can delete from the dict with no issues
        for hostname, worker_status in list(self.worker_last_seen.items()):
            since = now - worker_status["ts"]
            if since > self.worker_timeout_seconds and not worker_status["forgotten"]:
                logger.info(
                    f"Have not seen {hostname} for {since:0.2f} seconds. "
                    "Removing from metrics"
                )
                self.forget_worker(hostname)

            if self.purge_offline_worker_metrics_after_seconds > 0:
                if since > self.purge_offline_worker_metrics_after_seconds:
                    logger.info(
                        f"Have not seen {hostname} for {since:0.2f} seconds. "
                        "Purging worker metrics"
                    )
                    self.purge_worker_metrics(hostname)

so if the exporter timezone (time.time()) is something different than reported by celery (worker_status["ts"]), you may have a problem.

I've fixed setting the exporter timezone environment to mine (TZ=America/Detroit)

@lbellomo
Copy link

lbellomo commented Feb 8, 2024

Something similar happened to me.
The metrics did not show any tasks but did show some information about the workers:

$ curl localhost:9808/metrics
# HELP celery_task_sent_total Sent when a task message is published.
# TYPE celery_task_sent_total counter
# HELP celery_task_received_total Sent when the worker receives a task.
# TYPE celery_task_received_total counter
# HELP celery_task_started_total Sent just before the worker executes the task.
# TYPE celery_task_started_total counter
# HELP celery_task_succeeded_total Sent if the task executed successfully.
# TYPE celery_task_succeeded_total counter
# HELP celery_task_failed_total Sent if the execution of the task failed.
# TYPE celery_task_failed_total counter
# HELP celery_task_rejected_total The task was rejected by the worker, possibly to be re-queued or moved to a dead letter queue.
# TYPE celery_task_rejected_total counter
# HELP celery_task_revoked_total Sent if the task has been revoked.
# TYPE celery_task_revoked_total counter
# HELP celery_task_retried_total Sent if the task failed, but will be retried in the future.
# TYPE celery_task_retried_total counter
# HELP celery_worker_up Indicates if a worker has recently sent a heartbeat.
# TYPE celery_worker_up gauge
celery_worker_up{hostname="worker-host"} 1.0
# HELP celery_worker_tasks_active The number of tasks the worker is currently processing
# TYPE celery_worker_tasks_active gauge
celery_worker_tasks_active{hostname="worker-host"} 0.0
# HELP celery_task_runtime Histogram of task runtime measurements.
# TYPE celery_task_runtime histogram
# HELP celery_queue_length The number of message in broker queue.
# TYPE celery_queue_length gauge
celery_queue_length{queue_name="celery"} 0.0
celery_queue_length{queue_name="trigger"} 0.0
# HELP celery_active_consumer_count The number of active consumer in broker queue.
# TYPE celery_active_consumer_count gauge
# HELP celery_active_worker_count The number of active workers in broker queue.
# TYPE celery_active_worker_count gauge
celery_active_worker_count{queue_name="celery"} 0.0
celery_active_worker_count{queue_name="trigger"} 1.0
# HELP celery_active_process_count The number of active processes in broker queue.
# TYPE celery_active_process_count gauge
celery_active_process_count{queue_name="celery"} 0.0
celery_active_process_count{queue_name="trigger"} 12.0

And in the logs I saw:

2024-02-08 13:19:39.093 | INFO     | src.http_server:start_http_server:66 - Started celery-exporter at host='0.0.0.0' on port='9808'
2024-02-08 13:19:43.771 | INFO     | src.exporter:track_timed_out_workers:189 - Have not seen worker-host for 10801.08 seconds. Removing from metrics
2024-02-08 13:19:43.771 | INFO     | src.exporter:track_timed_out_workers:197 - Have not seen worker-host for 10801.08 seconds. Purging worker metrics

where 10800 is 3 hours difference (I'm in UTC−03:00). So it seems that clearing the metrics is just created by this difference in the timestamps due to the difference in timezones.

Adding a TZ environment variable with the correct timezone fixed the problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants