-
Notifications
You must be signed in to change notification settings - Fork 13.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add metrics about task CPU and memory usage #39650
Add metrics about task CPU and memory usage #39650
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very cool! Left some comments.
Also is it possible to unit test this?
docs/apache-airflow/administration-and-deployment/logging-monitoring/metrics.rst
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is good. However, isn't it good idea to capture the utilization metrics of the entire pod (including sidecar containers) instead of just the base container?
It seems very related to Kubernetes? I am trying to come up with a solution compatible across all executor environments. If it is possible to have such solution that is also compatible with other executors I am all ears but I dont have enough experience with Kubernetes to come up with such solution. Or maybe as a follow up PR if you want to add that? |
The only way I could find to unit test it is to check we are calling the function |
Nevermind! I found a solution! |
Co-authored-by: Andrey Anshin <Andrey.Anshin@taragol.is>
Any more concerns/comments? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a couple non-blocking comments/suggestions. LGTM otherwise!
Holy cardinality batman! |
Question about this PR: Memory and CPU are reported as a percentage of the available memory/CPU on the system, so to understand actual memory/CPU consumption (expressed in bytes/# of cores) you additionally need metrics on how much memory/CPU is available to the system. However... even if I have such metrics on available resources, since this PR only reports consumption on a DAG and task level (not task instance/mapped task instance), I'm unsure how useful it is to link those up. Additionally, with tasks that can run on different hardware, we could see different percentages while multiple instances of a task could consume the same amount of resources. Wouldn't it be more useful to report on |
If that's really a need, I would say let's report both metrics (percentage and actual number). I am pretty sure some folks rather have percentage metrics than actual number because they will have the opposite argument (knowing that a task consumes X memory is not really useful unless I know how much memory I got). |
I think (@howardyoo - @ferruzzi can you confirm?) the addition of traces, should make all the resource inormation automatically available if you enable it via Open-Telemetry (and traces will link the metrics about resources to tasks/dags automatically). From what I know OTEL has a way to enable all the "system"/ "python" etc. metrics out-of-the-box and the "traces" addition, shoudl (IMHO) label such metrics with appropriate labels for Airlfow "logical" tags - i.e. dags/task etc. See #37948 But maybe I am too optimistic there :) ? |
These metrics send CPU and memory usage for each task. They are sent as gauge every second.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.