Add metrics about task CPU and memory usage #39650

vincbeck · 2024-05-15T18:52:43Z

These metrics send CPU and memory usage for each task. They are sent as gauge every second.

^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

o-nikolas

Very cool! Left some comments.

Also is it possible to unit test this?

airflow/task/task_runner/standard_task_runner.py

docs/apache-airflow/administration-and-deployment/logging-monitoring/metrics.rst

dirrao

This is good. However, isn't it good idea to capture the utilization metrics of the entire pod (including sidecar containers) instead of just the base container?

airflow/task/task_runner/standard_task_runner.py

vincbeck · 2024-05-16T16:02:29Z

This is good. However, isn't it good idea to capture the utilization metrics of the entire pod (including sidecar containers) instead of just the base container?

It seems very related to Kubernetes? I am trying to come up with a solution compatible across all executor environments. If it is possible to have such solution that is also compatible with other executors I am all ears but I dont have enough experience with Kubernetes to come up with such solution. Or maybe as a follow up PR if you want to add that?

vincbeck · 2024-05-16T17:08:40Z

Very cool! Left some comments.

Also is it possible to unit test this?

The only way I could find to unit test it is to check we are calling the function _read_task_utilization but I could not find a solution to actually test the function _read_task_utilization.

vincbeck · 2024-05-16T19:36:45Z

Very cool! Left some comments.
Also is it possible to unit test this?

The only way I could find to unit test it is to check we are calling the function _read_task_utilization but I could not find a solution to actually test the function _read_task_utilization.

Nevermind! I found a solution!

airflow/task/task_runner/standard_task_runner.py

Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>

vincbeck · 2024-05-21T15:29:39Z

Any more concerns/comments?

o-nikolas

Left a couple non-blocking comments/suggestions. LGTM otherwise!

airflow/task/task_runner/standard_task_runner.py

vincbeck · 2024-05-22T17:24:30Z

@Taragolis

ashb · 2024-05-22T19:30:54Z

Holy cardinality batman!

BasPH · 2024-05-30T09:17:34Z

Question about this PR: Memory and CPU are reported as a percentage of the available memory/CPU on the system, so to understand actual memory/CPU consumption (expressed in bytes/# of cores) you additionally need metrics on how much memory/CPU is available to the system.

However... even if I have such metrics on available resources, since this PR only reports consumption on a DAG and task level (not task instance/mapped task instance), I'm unsure how useful it is to link those up. Additionally, with tasks that can run on different hardware, we could see different percentages while multiple instances of a task could consume the same amount of resources.

Wouldn't it be more useful to report on psutil.virtual_memory().total * psutil.memory_percent() to get consumption in bytes/# of cores? That way we can compare apples with apples.

vincbeck · 2024-05-30T13:58:44Z

Question about this PR: Memory and CPU are reported as a percentage of the available memory/CPU on the system, so to understand actual memory/CPU consumption (expressed in bytes/# of cores) you additionally need metrics on how much memory/CPU is available to the system.

However... even if I have such metrics on available resources, since this PR only reports consumption on a DAG and task level (not task instance/mapped task instance), I'm unsure how useful it is to link those up. Additionally, with tasks that can run on different hardware, we could see different percentages while multiple instances of a task could consume the same amount of resources.

Wouldn't it be more useful to report on psutil.virtual_memory().total * psutil.memory_percent() to get consumption in bytes/# of cores? That way we can compare apples with apples.

If that's really a need, I would say let's report both metrics (percentage and actual number). I am pretty sure some folks rather have percentage metrics than actual number because they will have the opposite argument (knowing that a task consumes X memory is not really useful unless I know how much memory I got).

potiuk · 2024-06-01T10:28:13Z

I think (@howardyoo - @ferruzzi can you confirm?) the addition of traces, should make all the resource inormation automatically available if you enable it via Open-Telemetry (and traces will link the metrics about resources to tasks/dags automatically). From what I know OTEL has a way to enable all the "system"/ "python" etc. metrics out-of-the-box and the "traces" addition, shoudl (IMHO) label such metrics with appropriate labels for Airlfow "logical" tags - i.e. dags/task etc.

See #37948

But maybe I am too optimistic there :) ?

Add metrics about task CPU and memory usage

7aa538e

vincbeck requested a review from potiuk as a code owner May 15, 2024 18:52

boring-cyborg bot added area:Scheduler Scheduler or dag parsing Issues kind:documentation labels May 15, 2024

o-nikolas reviewed May 15, 2024

View reviewed changes

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

airflow/task/task_runner/standard_task_runner.py Show resolved Hide resolved

Rename variable

4521515

Taragolis reviewed May 15, 2024

View reviewed changes

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

shubham22 reviewed May 15, 2024

View reviewed changes

docs/apache-airflow/administration-and-deployment/logging-monitoring/metrics.rst Show resolved Hide resolved

dirrao reviewed May 16, 2024

View reviewed changes

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

vincbeck added 2 commits May 16, 2024 11:44

Use oneshot

f031976

Remove _percent

f735ca5

Add unit tests

4b86598

Break out when process is done

5763140

Taragolis reviewed May 17, 2024

View reviewed changes

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

Taragolis reviewed May 17, 2024

View reviewed changes

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

vincbeck and others added 4 commits May 17, 2024 09:42

Update airflow/task/task_runner/standard_task_runner.py

487372b

Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>

Wrap up while in try expect

44dc20a

Merge branch 'main' into vincbeck/metric_cpu_mem_usage_task

984f0e0

Fix unit test

5ac5503

o-nikolas approved these changes May 21, 2024

View reviewed changes

airflow/task/task_runner/standard_task_runner.py Show resolved Hide resolved

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved

Add log and increase sleep time to 5 secs

b5a9c41

Taragolis approved these changes May 22, 2024

View reviewed changes

vincbeck merged commit 9139b22 into apache:main May 22, 2024
42 checks passed

vincbeck deleted the vincbeck/metric_cpu_mem_usage_task branch May 22, 2024 17:27

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics about task CPU and memory usage #39650

Add metrics about task CPU and memory usage #39650

vincbeck commented May 15, 2024

o-nikolas left a comment

dirrao left a comment

vincbeck commented May 16, 2024

vincbeck commented May 16, 2024

vincbeck commented May 16, 2024

vincbeck commented May 21, 2024

o-nikolas left a comment

vincbeck commented May 22, 2024

ashb commented May 22, 2024

BasPH commented May 30, 2024

vincbeck commented May 30, 2024

potiuk commented Jun 1, 2024

Add metrics about task CPU and memory usage #39650

Add metrics about task CPU and memory usage #39650

Conversation

vincbeck commented May 15, 2024

o-nikolas left a comment

Choose a reason for hiding this comment

dirrao left a comment

Choose a reason for hiding this comment

vincbeck commented May 16, 2024

vincbeck commented May 16, 2024

vincbeck commented May 16, 2024

vincbeck commented May 21, 2024

o-nikolas left a comment

Choose a reason for hiding this comment

vincbeck commented May 22, 2024

ashb commented May 22, 2024

BasPH commented May 30, 2024

vincbeck commented May 30, 2024

potiuk commented Jun 1, 2024