Expose stats #279

neob91-close · 2023-04-19T16:51:50Z

This allows hooking into the stats by defining a STATS_CALLBACK which will receive task runtimes.
This is useful for measuring the utilisation of workers.

thomasst

Would like to know some more details on how this will be used exactly before approving.

thomasst · 2023-04-20T10:06:48Z

tasktiger/stats.py

+    def __init__(self, stats: Stats) -> None:
+        super().__init__()
+
+        self.tiger = stats.tiger


Should this ideally just say self._stats_interval = stats.tiger.config["STATS_INTERVAL"] rather than looking it up every time?

thomasst · 2023-04-20T10:22:03Z

tasktiger/tasktiger.py

+            # For example, the worker's utilisation over the last 30 minutes
+            # can be obtained by dividing the sum of task durations reported
+            # over the last 30 minutes by 30 minutes.
+            "STATS_CALLBACK": None,


Since the point of the stats thread is to print out metrics periodically, I would have also expected this to be called periodically rather than at the end of the task, and report the same metrics that we log (time total, time busy, utilization). Otherwise this could be solved via something like CHILD_CONTEXT_MANAGERS, although that runs in the child process. Should this be called TASK_END_STATS_CALLBACK then?

the worker's utilisation over the last 30 minutes can be obtained by dividing the sum of task durations reported over the last 30 minutes by 30 minutes.

This is not actually correct. Example: If a task runs from 0m to 5m, and another task runs from 29m59s-34m59s, and the second task's stats callback looks at any reports in the past 30 minutes (i.e. 5m - 34m59s) , it would show 10m in task durations over 30m (~33% utilization), rather than the actual 5m1s (~17% utilization).

We should definitely have both start and end callbacks to measure and report utilization accurately while the task is still running. With our configuration, we often collect metrics on sub-minute intervals, and a single task can run for more than a few minutes.

Since the point of the stats thread is to print out metrics periodically, I would have also expected this to be called periodically rather than at the end of the task, and report the same metrics that we log (time total, time busy, utilization).

Yep, that sounds better than what I did.

This is not actually correct. Example: If a task runs from 0m to 5m, and another task runs from 29m59s-34m59s, and the second task's stats callback looks at any reports in the past 30 minutes (i.e. 5m - 34m59s) , it would show 10m in task durations over 30m (~33% utilization), rather than the actual 5m1s (~17% utilization).

Good point.

We should definitely have both start and end callbacks to measure and report utilization accurately while the task is still running. With our configuration, we often collect metrics on sub-minute intervals, and a single task can run for more than a few minutes.

Couldn't we achieve it with simply moving the callback to the log method to be called at an interval?
I don't think we need to get notified of the start and stop of every task.

get notified of the start and stop of every task.

Having that interface allows for reporting much more than what we do now. We could report utilization grouped per some other metric, for example task duration bucket, task name, or success/failure condition.

The interval of metric collection should be owned by the metric collection setup, not by tasktiger. Tasktiger does provide a simple log-only "monitoring" of periodic prints but in production we could use much more sophisticated stuff like opentelemetry. Reporting metrics on an interval is also not great if your metric collection is also pull-based periodic and the periods don't align well. For example, there was a task running between 1s and 16s. At 30s Tasktiger reports utilization of 50% (15s busy out of last 30s) and then does nothing for then next minute. At 59s (just before next log call at 60s) monitoring pulls the last reported value of 50%. The monitoring shows utilization of 50% at 59s while in reality it was 0%. The metric is lagging and can be very confusing when debugging performance issues. In contrast, with start/end notifications we can compute utilization metric right at the collection time - or better yet, report the up-to-date (not lagging) totals of busy/idle times and let the monitoring compute utilization.

neob91-close force-pushed the statsthread-export branch from 5fc01f6 to b63a2e9 Compare April 20, 2023 07:43

Expose stats

54ce432

neob91-close force-pushed the statsthread-export branch from b63a2e9 to 54ce432 Compare April 20, 2023 07:44

neob91-close requested review from thomasst and tsx April 20, 2023 08:24

neob91-close marked this pull request as ready for review April 20, 2023 08:24

neob91-close requested a review from nsaje April 20, 2023 08:24

thomasst reviewed Apr 20, 2023

View reviewed changes

vtclose removed the request for review from tsx May 11, 2023 13:02

nsaje removed their request for review August 31, 2023 15:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Expose stats #279

Expose stats #279

neob91-close commented Apr 19, 2023

thomasst left a comment

thomasst Apr 20, 2023

thomasst Apr 20, 2023 •

edited

vtclose Apr 20, 2023

neob91-close Apr 21, 2023

vtclose Apr 25, 2023

Expose stats #279

Are you sure you want to change the base?

Expose stats #279

Conversation

neob91-close commented Apr 19, 2023

thomasst left a comment

Choose a reason for hiding this comment

thomasst Apr 20, 2023

Choose a reason for hiding this comment

thomasst Apr 20, 2023 • edited

Choose a reason for hiding this comment

vtclose Apr 20, 2023

Choose a reason for hiding this comment

neob91-close Apr 21, 2023

Choose a reason for hiding this comment

vtclose Apr 25, 2023

Choose a reason for hiding this comment

thomasst Apr 20, 2023 •

edited