Add Worker metrics. #5234

decko · 2024-04-08T22:38:04Z

Close #3821

pulpcore/app/models/telemetry.py

pulpcore/tasking/worker.py

pulpcore/tests/functional/__init__.py

pulpcore/tests/functional/api/test_tasking.py

pulpcore/tests/functional/assets/otel_server.py

pulpcore/tasking/worker.py

pulpcore/constants.py

pulpcore/tasking/worker.py

pulpcore/tests/functional/assets/otel_server.py

pulpcore/tasking/worker.py

pulpcore/exceptions/__init__.py

pulpcore/tasking/worker.py

pulpcore/tests/functional/api/test_tasking.py

mdellweg · 2024-05-03T23:57:52Z

pulpcore/tests/functional/api/test_tasking.py

+            "description": "Number of unblocked tasks waiting in the queue.",
+            "unit": "tasks",
+        }
+    )


What about the value of the metric?

As we talked before, this involves some changes to the test machinery, and yet we probably won't be able to catch the right value for the metric during the tests.
Talking with @lubosmj and @dkliban we understood that changing to use the OpenTelemetry Collector and test the metrics as it's exported to be consumed by Prometheus could generate better results. Yet, it's a considerable effort that is out of the scope of this task.

lubosmj · 2024-05-06T09:06:32Z

pulpcore/tasking/worker.py

+                    unblocked_tasks_stats["longest_unblocked_waiting_time"].seconds
+                )
+
+                self.cursor.execute(f"NOTIFY pulp_worker_metrics_heartbeat, '{str(now)}'")


I would think we do not need to re-notify the worker, do we?

We notify all workers, that the work was done. So they hold off of doing it for another cooldown time.

But since you asked, a comment to that end may help.

Thanks man. In the end I removed since it doesn't make sense to be used anymore.

Sorry @lubosmj. @mdellweg reminded me why we're using this. #5234 (review)

lubosmj · 2024-05-06T09:06:36Z

pulpcore/tasking/worker.py

@@ -392,6 +419,42 @@ def handle_available_tasks(self):
                keep_looping = True
                self.supervise_task(task)

+    def record_unblocked_waiting_tasks_metric(self):


Is there any safeguard that prohibits the execution of this method if the telemetry is disabled?

Nope. It simply will not be emitted if the agent is not used.

The question is valid. In the current state, we still run the query even if we send nothing.

don't we have an otel_enabled setting we can look at?

CHANGES/3821.feature

lubosmj

At this stage of the implementation, I think we can squash all the commits into a single one. What do you think?

pulpcore/tasking/worker.py

pulpcore/app/models/task.py

pulpcore/tests/functional/assets/otel_server.py

pulpcore/tests/functional/api/test_tasking.py

mdellweg · 2024-05-13T09:53:49Z

pulpcore/tasking/worker.py

@@ -392,6 +419,42 @@ def handle_available_tasks(self):
                keep_looping = True
                self.supervise_task(task)

+    def record_unblocked_waiting_tasks_metric(self):


The question is valid. In the current state, we still run the query even if we send nothing.

Closes pulp#3821 Co-authored-by: Matthias Dellweg <2500@gmx.de> Co-authored-by: Ľuboš Mjachky <lmjachky@redhat.com> Co-authored-by: Grant Gainey <ggainey@users.noreply.github.com> Co-authored-by: Ina Panova <ipanova@redhat.com>

mdellweg

There's the last question remaining, how to avoid the costly db queries on systems that do not run with otel.

mdellweg · 2024-05-13T14:20:13Z

pulpcore/tests/functional/api/test_tasking.py

+        pytest.skip("Need PULP_OTEL_ENABLED to run this test.")
+
+    # Checking online workers ready to get a task
+    workers_online = pulpcore_bindings.WorkersApi.list(online="true").count


decko force-pushed the worker_metrics branch from 0909c11 to cfcaffc Compare April 8, 2024 22:40

github-actions bot added the multi-commit label Apr 10, 2024