Not recognizing worker in celery-exporter logs. #262

mksmkzk · 2023-08-03T23:21:52Z

Hello, I am building out a monitoring stack for a django project. I am using celery, redis, and flower. Everything is running in docker containers.

The worker is recognized through flower, but when I check the logs for the celery exporter, I get this message

2023-08-03 15:54:22 2023-08-03 22:54:22.160 | INFO | src.exporter:track_timed_out_workers:188 - Have not seen cc2ae99afb58 for 25201.57 seconds. Removing from metrics

The tasks are being sent and recorded correctly in flower, but when I do the celery_tasks_received_total metric, none of them are registering.

I have events enables and have the worker sending out a heartbeat every 10 seconds. celery-exporter connects to the broker. I am not sure what is going on or how to even approach this anymore. Hoping someone could guide me in the right direction.

danihodovic · 2023-08-06T17:22:07Z

@adinhodovic

adinhodovic · 2023-08-06T22:13:20Z

The cleanup of metrics happens when workers are timed out, it helps to lower label cardinality but this can be disabled. However, it happens by default after a timeout of 5 minutes, you do not see any metrics in your prometheus tsdb?

mksmkzk · 2023-08-06T22:17:08Z

The timeout happens for me on startup and going forward. I have access to some metrics but not all of them. This is what i see from the metrics endpoint. But even the metrics that are being sent is not including the complete list of tasks which should be firing.

# HELP celery_task_sent_total Sent when a task message is published.
# TYPE celery_task_sent_total counter
celery_task_sent_total{hostname="76a4d3cbc22a",name="scheduler.tasks.send_unresponded_notifications",queue_name="celery"} 390.0
# HELP celery_task_sent_created Sent when a task message is published.
# TYPE celery_task_sent_created gauge
celery_task_sent_created{hostname="76a4d3cbc22a",name="scheduler.tasks.send_unresponded_notifications",queue_name="celery"} 1.691126270735308e+09
# HELP celery_task_received_total Sent when the worker receives a task.
# TYPE celery_task_received_total counter
# HELP celery_task_started_total Sent just before the worker executes the task.
# TYPE celery_task_started_total counter
# HELP celery_task_succeeded_total Sent if the task executed successfully.
# TYPE celery_task_succeeded_total counter
# HELP celery_task_failed_total Sent if the execution of the task failed.
# TYPE celery_task_failed_total counter
# HELP celery_task_rejected_total The task was rejected by the worker, possibly to be re-queued or moved to a dead letter queue.
# TYPE celery_task_rejected_total counter
# HELP celery_task_revoked_total Sent if the task has been revoked.
# TYPE celery_task_revoked_total counter
# HELP celery_task_retried_total Sent if the task failed, but will be retried in the future.
# TYPE celery_task_retried_total counter
# HELP celery_worker_up Indicates if a worker has recently sent a heartbeat.
# TYPE celery_worker_up gauge
celery_worker_up{hostname="6a2aba40f48b"} 1.0
# HELP celery_worker_tasks_active The number of tasks the worker is currently processing
# TYPE celery_worker_tasks_active gauge
celery_worker_tasks_active{hostname="6a2aba40f48b"} 0.0
# HELP celery_task_runtime Histogram of task runtime measurements.
# TYPE celery_task_runtime histogram
# HELP celery_queue_length The number of message in broker queue.
# TYPE celery_queue_length gauge
celery_queue_length{queue_name="celery"} 0.0
# HELP celery_active_consumer_count The number of active consumer in broker queue.
# TYPE celery_active_consumer_count gauge
celery_active_consumer_count{queue_name="celery"} 1.0
# HELP celery_active_worker_count The number of active workers in broker queue.
# TYPE celery_active_worker_count gauge
celery_active_worker_count{queue_name="celery"} 1.0
# HELP celery_active_process_count The number of active processes in broker queue.
# TYPE celery_active_process_count gauge
celery_active_process_count{queue_name="celery"} 4.0

adinhodovic · 2023-08-06T22:21:48Z

Are your workers picking up the tasks and executing them? The metrics above just indicate that the client has sent tasks to the queue (note celery_task_sent_total), but actually none of the metrics that indicate that work is being done is there e.g celery_task_received_total (worker has received the task).

Enabled metrics for the workers?

# In celeryconfig.py
worker_send_task_events = True

mksmkzk · 2023-08-06T23:42:31Z

Yeah I have metrics enabled for the worker and the tasks are being executed. The tasks are being picked up correctly through flower as well.

adinhodovic · 2023-08-07T13:53:10Z

Try running it with debug logging.

mksmkzk · 2023-08-09T00:01:50Z

@adinhodovic

These are the logs. It registers the heartbeat but still times out the worker. I have the workers sending a heartbeat every 10 seconds.

`
2023-08-08 16:48:25 2023-08-08 23:48:25.805 | DEBUG | src.exporter:track_worker_heartbeat:316 - Received event='worker-heartbeat' for worker='7e7bf8bb1670'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:324 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:327 - Updated gauge='celery_worker_up' value='1'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | INFO | src.exporter:track_timed_out_workers:188 - Have not seen 7e7bf8bb1670 for 25200.56 seconds. Removing from metrics

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:147 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:150 - Updated gauge='celery_worker_up' value='0'
`

The individual tasks also end up getting registered in the statistics, but the metrics aren't showing up correctly.

`
2023-08-08 16:48:47 2023-08-08 23:48:47.249 | DEBUG | src.exporter:track_task_event:257 - Received event='task-started' for task='scheduler.tasks.send_unresponded_notifications'

2023-08-08 16:48:47 2023-08-08 23:48:47.277 | DEBUG | src.exporter:track_task_event:257 - Received event='task-sent' for task='scheduler.tasks.send_notification'
`

I did notice that the everytime it does time out the worker, the time is always 25000+ seconds. I'm curious if that might be the issue.

adinhodovic · 2023-08-09T13:35:14Z

@adinhodovic

These are the logs. It registers the heartbeat but still times out the worker. I have the workers sending a heartbeat every 10 seconds.

` 2023-08-08 16:48:25 2023-08-08 23:48:25.805 | DEBUG | src.exporter:track_worker_heartbeat:316 - Received event='worker-heartbeat' for worker='7e7bf8bb1670'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:324 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:25 2023-08-08 23:48:25.806 | DEBUG | src.exporter:track_worker_heartbeat:327 - Updated gauge='celery_worker_up' value='1'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | INFO | src.exporter:track_timed_out_workers:188 - Have not seen 7e7bf8bb1670 for 25200.56 seconds. Removing from metrics

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:147 - Updated gauge='celery_worker_tasks_active' value='0'

2023-08-08 16:48:26 2023-08-08 23:48:26.368 | DEBUG | src.exporter:forget_worker:150 - Updated gauge='celery_worker_up' value='0' `

The individual tasks also end up getting registered in the statistics, but the metrics aren't showing up correctly.

` 2023-08-08 16:48:47 2023-08-08 23:48:47.249 | DEBUG | src.exporter:track_task_event:257 - Received event='task-started' for task='scheduler.tasks.send_unresponded_notifications'

2023-08-08 16:48:47 2023-08-08 23:48:47.277 | DEBUG | src.exporter:track_task_event:257 - Received event='task-sent' for task='scheduler.tasks.send_notification' `

I did notice that the everytime it does time out the worker, the time is always 25000+ seconds. I'm curious if that might be the issue.

It seems to be working fine, events are received and metrics seem to be generated. Why your worker times out, I'm not sure of - but nothing to do with the exporter (the exporter just marks it not up and cleans up the metrics for it). I'd still expect metrics in your Prometheus TSDB if you are scraping the exporter.

pedroits · 2023-10-09T18:28:43Z

I run everything in my environment inside containers, so for me, it had something to do with timezone.
In my case all workers had 14400 seconds "not responding", not by coincidence that's my timezone shift to GMT.

src.exporter.track_timed_out_workers:

    def track_timed_out_workers(self):
        now = time.time()
        # Make a copy of the last seen dict so we can delete from the dict with no issues
        for hostname, worker_status in list(self.worker_last_seen.items()):
            since = now - worker_status["ts"]
            if since > self.worker_timeout_seconds and not worker_status["forgotten"]:
                logger.info(
                    f"Have not seen {hostname} for {since:0.2f} seconds. "
                    "Removing from metrics"
                )
                self.forget_worker(hostname)

            if self.purge_offline_worker_metrics_after_seconds > 0:
                if since > self.purge_offline_worker_metrics_after_seconds:
                    logger.info(
                        f"Have not seen {hostname} for {since:0.2f} seconds. "
                        "Purging worker metrics"
                    )
                    self.purge_worker_metrics(hostname)

so if the exporter timezone (time.time()) is something different than reported by celery (worker_status["ts"]), you may have a problem.

I've fixed setting the exporter timezone environment to mine (TZ=America/Detroit)

lbellomo · 2024-02-08T14:14:31Z

Something similar happened to me.
The metrics did not show any tasks but did show some information about the workers:

$ curl localhost:9808/metrics
# HELP celery_task_sent_total Sent when a task message is published.
# TYPE celery_task_sent_total counter
# HELP celery_task_received_total Sent when the worker receives a task.
# TYPE celery_task_received_total counter
# HELP celery_task_started_total Sent just before the worker executes the task.
# TYPE celery_task_started_total counter
# HELP celery_task_succeeded_total Sent if the task executed successfully.
# TYPE celery_task_succeeded_total counter
# HELP celery_task_failed_total Sent if the execution of the task failed.
# TYPE celery_task_failed_total counter
# HELP celery_task_rejected_total The task was rejected by the worker, possibly to be re-queued or moved to a dead letter queue.
# TYPE celery_task_rejected_total counter
# HELP celery_task_revoked_total Sent if the task has been revoked.
# TYPE celery_task_revoked_total counter
# HELP celery_task_retried_total Sent if the task failed, but will be retried in the future.
# TYPE celery_task_retried_total counter
# HELP celery_worker_up Indicates if a worker has recently sent a heartbeat.
# TYPE celery_worker_up gauge
celery_worker_up{hostname="worker-host"} 1.0
# HELP celery_worker_tasks_active The number of tasks the worker is currently processing
# TYPE celery_worker_tasks_active gauge
celery_worker_tasks_active{hostname="worker-host"} 0.0
# HELP celery_task_runtime Histogram of task runtime measurements.
# TYPE celery_task_runtime histogram
# HELP celery_queue_length The number of message in broker queue.
# TYPE celery_queue_length gauge
celery_queue_length{queue_name="celery"} 0.0
celery_queue_length{queue_name="trigger"} 0.0
# HELP celery_active_consumer_count The number of active consumer in broker queue.
# TYPE celery_active_consumer_count gauge
# HELP celery_active_worker_count The number of active workers in broker queue.
# TYPE celery_active_worker_count gauge
celery_active_worker_count{queue_name="celery"} 0.0
celery_active_worker_count{queue_name="trigger"} 1.0
# HELP celery_active_process_count The number of active processes in broker queue.
# TYPE celery_active_process_count gauge
celery_active_process_count{queue_name="celery"} 0.0
celery_active_process_count{queue_name="trigger"} 12.0

And in the logs I saw:

2024-02-08 13:19:39.093 | INFO     | src.http_server:start_http_server:66 - Started celery-exporter at host='0.0.0.0' on port='9808'
2024-02-08 13:19:43.771 | INFO     | src.exporter:track_timed_out_workers:189 - Have not seen worker-host for 10801.08 seconds. Removing from metrics
2024-02-08 13:19:43.771 | INFO     | src.exporter:track_timed_out_workers:197 - Have not seen worker-host for 10801.08 seconds. Purging worker metrics

where 10800 is 3 hours difference (I'm in UTC−03:00). So it seems that clearing the metrics is just created by this difference in the timestamps due to the difference in timezones.

Adding a TZ environment variable with the correct timezone fixed the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Not recognizing worker in celery-exporter logs. #262

Not recognizing worker in celery-exporter logs. #262

mksmkzk commented Aug 3, 2023 •

edited

danihodovic commented Aug 6, 2023

adinhodovic commented Aug 6, 2023 •

edited

mksmkzk commented Aug 6, 2023

adinhodovic commented Aug 6, 2023 •

edited

mksmkzk commented Aug 6, 2023

adinhodovic commented Aug 7, 2023

mksmkzk commented Aug 9, 2023 •

edited

adinhodovic commented Aug 9, 2023 •

edited

pedroits commented Oct 9, 2023

lbellomo commented Feb 8, 2024

Not recognizing worker in celery-exporter logs. #262

Not recognizing worker in celery-exporter logs. #262

Comments

mksmkzk commented Aug 3, 2023 • edited

danihodovic commented Aug 6, 2023

adinhodovic commented Aug 6, 2023 • edited

mksmkzk commented Aug 6, 2023

adinhodovic commented Aug 6, 2023 • edited

mksmkzk commented Aug 6, 2023

adinhodovic commented Aug 7, 2023

mksmkzk commented Aug 9, 2023 • edited

adinhodovic commented Aug 9, 2023 • edited

pedroits commented Oct 9, 2023

lbellomo commented Feb 8, 2024

mksmkzk commented Aug 3, 2023 •

edited

adinhodovic commented Aug 6, 2023 •

edited

adinhodovic commented Aug 6, 2023 •

edited

mksmkzk commented Aug 9, 2023 •

edited

adinhodovic commented Aug 9, 2023 •

edited