Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set parallelism log messages to warning level for better visiblity #39298

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

BasPH
Copy link
Contributor

@BasPH BasPH commented Apr 28, 2024

This PR raises the log level of several log messages related to parallelism to warning level. IMO it's currently not evident when Airflow reaches a parallelism limit so the purpose of this PR is to make it more obvious that a scalability limit is reached.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in newsfragments.

@boring-cyborg boring-cyborg bot added area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler Scheduler or dag parsing Issues labels Apr 28, 2024
Copy link
Member

@hussein-awala hussein-awala left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-0.9 for this change, reaching the parallelism limits is a normal and expected thing and we should not warn the user when it happens.

@BasPH
Copy link
Contributor Author

BasPH commented Apr 28, 2024

Why should we not warn the user?

@hussein-awala
Copy link
Member

I think the question is why should we warn the user when the parallel limits are reached?

@BasPH
Copy link
Contributor Author

BasPH commented Apr 28, 2024

To inform the user that a software limit has been reached. For example, unless you enable debug logging there's no way to tell from the logs that AIRFLOW__CORE__PARALLELISM was reached. The default limits are rather low in Airflow so this is a common situation and it's currently not self-evident.

@eladkal
Copy link
Contributor

eladkal commented Apr 28, 2024

To inform the user that a software limit has been reached. For example, unless you enable debug logging there's no way to tell from the logs that AIRFLOW__CORE__PARALLELISM was reached. The default limits are rather low in Airflow so this is a common situation and it's currently not self-evident.

I think this is more sutaible for cluster activity dashboard to expose this in actionable manner. Logs are a bit problematic because what you get here is notification that parallesim is reach but you have no information if user set this specific value by design or not. If user chose the specific value then logging warnings is not very user freidnly and in fact not actionable. This can result in further user request to silent the messages. In my point of view If we can find a way to present the information in cluster activity dashboard that would be prefered.

@RNHTTR
Copy link
Collaborator

RNHTTR commented Apr 29, 2024

-0.9 for this change, reaching the parallelism limits is a normal and expected thing and we should not warn the user when it happens.

This is only true for users whose Airflow environment is somewhat optimized. There are a lot of configs to consider (parallelism, worker concurrency, pools) in order to limit task throughput -- surfacing more info to the user could only be helpful IMO.

I think this is more sutaible for cluster activity dashboard to expose this in actionable manner. Logs are a bit problematic because what you get here is notification that parallesim is reach but you have no information if user set this specific value by design or not. If user chose the specific value then logging warnings is not very user freidnly and in fact not actionable. This can result in further user request to silent the messages. In my point of view If we can find a way to present the information in cluster activity dashboard that would be prefered.

Why not both? I personally spend a lot more time debugging Airflow logs than I do using the cluster activity dashboard. If parallelism is reached unexpectedly such that it causes a problem, I'd definitely find it faster in the logs.

@jedcunningham
Copy link
Member

My 2c, I think info is appropriate.

@hussein-awala
Copy link
Member

This is only true for users whose Airflow environment is somewhat optimized. There are a lot of configs to consider (parallelism, worker concurrency, pools) in order to limit task throughput -- surfacing more info to the user could only be helpful IMO.

Exactly, in this case:

  1. It would be useless to warn the user when the pool limit or dag/run/task max concurrency are reached, because these limits are usually set according to the capacity of an external system
  2. Some users (including me) will find some of these warnings annoying especially that we can use the OTEL or statsd metrics to monitor the concurrency and create alerts.

@RNHTTR
Copy link
Collaborator

RNHTTR commented Apr 30, 2024

This is only true for users whose Airflow environment is somewhat optimized. There are a lot of configs to consider (parallelism, worker concurrency, pools) in order to limit task throughput -- surfacing more info to the user could only be helpful IMO.

Exactly, in this case:

  1. It would be useless to warn the user when the pool limit or dag/run/task max concurrency are reached, because these limits are usually set according to the capacity of an external system
  2. Some users (including me) will find some of these warnings annoying especially that we can use the OTEL or statsd metrics to monitor the concurrency and create alerts.

Yeah that's fair. Do you agree that at least an info log is reasonable?

@BasPH
Copy link
Contributor Author

BasPH commented May 2, 2024

Okay I reverted a few logging statements so that now leaves an info log when AIRFLOW__CORE__PARALLELISM is reached, which should help users understand why Airflow isn't running any more tasks.

@hussein-awala Regarding:

It would be useless to warn the user when the pool limit or dag/run/task max concurrency are reached, because these limits are usually set according to the capacity of an external system

Disagree with this statement. Airflow has lots of knobs to tweak for parallelism which can be a blessing and a curse. You might have a perfect setup with OTEL integration, dashboards, and alerts, but I'm sure that knowledge and tooling wasn't developed overnight. My experience is that Airflow's default settings quite often limit the user before they start reaching hardware limits. This is a frustrating problem because as of today visibility (via UI/logs) on such limits is low. I'll investigate alternative options such as dashboards, but stating that it's "useless to warn the user" is not my experience since 9 out of 10 times I see people hitting a default limit instead of a user-defined limit.

Sidenote: we currently have one warning-level log regarding pool limits, should we downgrade that to INFO level? https://github.com/apache/airflow/blob/main/airflow/jobs/scheduler_job_runner.py#L450

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:Executors-core LocalExecutor & SequentialExecutor area:Scheduler Scheduler or dag parsing Issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants