Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add metrics about task CPU and memory usage #39650

Merged
merged 11 commits into from
May 22, 2024

Conversation

vincbeck
Copy link
Contributor

These metrics send CPU and memory usage for each task. They are sent as gauge every second.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@vincbeck vincbeck requested a review from potiuk as a code owner May 15, 2024 18:52
@boring-cyborg boring-cyborg bot added area:Scheduler Scheduler or dag parsing Issues kind:documentation labels May 15, 2024
Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool! Left some comments.

Also is it possible to unit test this?

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved
airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved
Copy link
Collaborator

@dirrao dirrao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good. However, isn't it good idea to capture the utilization metrics of the entire pod (including sidecar containers) instead of just the base container?

@vincbeck
Copy link
Contributor Author

This is good. However, isn't it good idea to capture the utilization metrics of the entire pod (including sidecar containers) instead of just the base container?

It seems very related to Kubernetes? I am trying to come up with a solution compatible across all executor environments. If it is possible to have such solution that is also compatible with other executors I am all ears but I dont have enough experience with Kubernetes to come up with such solution. Or maybe as a follow up PR if you want to add that?

@vincbeck
Copy link
Contributor Author

Very cool! Left some comments.

Also is it possible to unit test this?

The only way I could find to unit test it is to check we are calling the function _read_task_utilization but I could not find a solution to actually test the function _read_task_utilization.

@vincbeck
Copy link
Contributor Author

Very cool! Left some comments.
Also is it possible to unit test this?

The only way I could find to unit test it is to check we are calling the function _read_task_utilization but I could not find a solution to actually test the function _read_task_utilization.

Nevermind! I found a solution!

@vincbeck
Copy link
Contributor Author

Any more concerns/comments?

Copy link
Contributor

@o-nikolas o-nikolas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a couple non-blocking comments/suggestions. LGTM otherwise!

airflow/task/task_runner/standard_task_runner.py Outdated Show resolved Hide resolved
@vincbeck
Copy link
Contributor Author

@Taragolis

@vincbeck vincbeck merged commit 9139b22 into apache:main May 22, 2024
42 checks passed
@vincbeck vincbeck deleted the vincbeck/metric_cpu_mem_usage_task branch May 22, 2024 17:27
@ashb
Copy link
Member

ashb commented May 22, 2024

Holy cardinality batman!

@BasPH
Copy link
Contributor

BasPH commented May 30, 2024

Question about this PR: Memory and CPU are reported as a percentage of the available memory/CPU on the system, so to understand actual memory/CPU consumption (expressed in bytes/# of cores) you additionally need metrics on how much memory/CPU is available to the system.

However... even if I have such metrics on available resources, since this PR only reports consumption on a DAG and task level (not task instance/mapped task instance), I'm unsure how useful it is to link those up. Additionally, with tasks that can run on different hardware, we could see different percentages while multiple instances of a task could consume the same amount of resources.

Wouldn't it be more useful to report on psutil.virtual_memory().total * psutil.memory_percent() to get consumption in bytes/# of cores? That way we can compare apples with apples.

@vincbeck
Copy link
Contributor Author

Question about this PR: Memory and CPU are reported as a percentage of the available memory/CPU on the system, so to understand actual memory/CPU consumption (expressed in bytes/# of cores) you additionally need metrics on how much memory/CPU is available to the system.

However... even if I have such metrics on available resources, since this PR only reports consumption on a DAG and task level (not task instance/mapped task instance), I'm unsure how useful it is to link those up. Additionally, with tasks that can run on different hardware, we could see different percentages while multiple instances of a task could consume the same amount of resources.

Wouldn't it be more useful to report on psutil.virtual_memory().total * psutil.memory_percent() to get consumption in bytes/# of cores? That way we can compare apples with apples.

If that's really a need, I would say let's report both metrics (percentage and actual number). I am pretty sure some folks rather have percentage metrics than actual number because they will have the opposite argument (knowing that a task consumes X memory is not really useful unless I know how much memory I got).

@potiuk
Copy link
Member

potiuk commented Jun 1, 2024

I think (@howardyoo - @ferruzzi can you confirm?) the addition of traces, should make all the resource inormation automatically available if you enable it via Open-Telemetry (and traces will link the metrics about resources to tasks/dags automatically). From what I know OTEL has a way to enable all the "system"/ "python" etc. metrics out-of-the-box and the "traces" addition, shoudl (IMHO) label such metrics with appropriate labels for Airlfow "logical" tags - i.e. dags/task etc.

See #37948

But maybe I am too optimistic there :) ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Scheduler Scheduler or dag parsing Issues kind:documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

8 participants