Add `retry_from_failure` parameter to DbtCloudRunJobOperator #38868

boraberke · 2024-04-09T14:57:18Z

This PR adds a new retry_from_failure parameter to the DbtRunJobOperator to retry a failed run of a dbt Cloud job from the point of failure. The implementation uses the new rerun endpoint in the dbt API which handles the lookup of the last run for a given job itself and decides whether to start a new run of the job or not.

New endpoint is only used when retry_from_failure is True and try_number of the task is greater than 1. It also cannot be used in conjunction with steps_override, schema_override and additional_run_config.

Closes: #35772
See also: #38001

boring-cyborg · 2024-04-09T14:57:23Z

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
Be sure to read the Airflow Coding style.
Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
Apache Airflow is a community-driven project and together we are making it better 🚀.
In case of doubts contact the developers at:
Mailing List: dev@airflow.apache.org
Slack: https://s.apache.org/airflow-slack

boraberke · 2024-04-09T15:03:57Z

Hey @josh-fell, I've created this PR based on your comments here. I would really appreciate it if you could take a look at it!

boraberke · 2024-04-25T18:17:36Z

Hi @josh-fell, did you have a chance to look at this PR? Would appreciate your comments!

airflow/providers/dbt/cloud/hooks/dbt.py

Lee-W · 2024-04-26T08:10:02Z

rerun endpoint does not accept body, which means parameters like steps_override, schema_override, threads_override, cause cannot be passed. Current implementation always uses rerun endpoint if retry_from_failure is set to True. To overcome this issue, rerun endpoint can be used only if the task is retried (i.e. ti.try_number !=1).

Maybe we can check whether steps_override, schema_override, threads_override are provided with retry_from_failure and raise an error if so. Also, did you implement the ti.try_number part already?

boraberke · 2024-04-28T19:31:39Z

rerun endpoint does not accept body, which means parameters like steps_override, schema_override, threads_override, cause cannot be passed. Current implementation always uses rerun endpoint if retry_from_failure is set to True. To overcome this issue, rerun endpoint can be used only if the task is retried (i.e. ti.try_number !=1).

Maybe we can check whether steps_override, schema_override, threads_override are provided with retry_from_failure and raise an error if so. Also, did you implement the ti.try_number part already?

Hi @Lee-W,

I have also implemented the try_number part, could you please take a look at it as well?

And also, if steps_override, schema_override, threads_override are provided with retry_from_failure, should it be a warning or an error? Displaying a warning and discarding the values of overrides might also be an option.

Lee-W · 2024-04-29T03:23:11Z

I have also implemented the try_number part, could you please take a look at it as well?

Yep, just found it

And also, if steps_override, schema_override, threads_override are provided with retry_from_failure, should it be a warning or an error? Displaying a warning and discarding the values of overrides might also be an option.

I feel like an error might makes more sense 🤔 I don't personally use dbt that much, but I guess steps_override, schema_override, threads_override could significantly change the behavior somehow. If that's the case, it might be better if we raise an error. But please correct me if I'm wrong 🙂 Thanks!

boraberke · 2024-04-29T06:47:52Z

I feel like an error might makes more sense 🤔 I don't personally use dbt that much, but I guess steps_override, schema_override, threads_override could significantly change the behavior somehow. If that's the case, it might be better if we raise an error. But please correct me if I'm wrong 🙂 Thanks!

Yes, those parameters change the behavior significantly. My only concern is with try_number > 1 check, these parameters can actually work in the first run, i.e. try_number = 1. We can either;

Do not allow users to use steps_override, schema_override, additional_run_config when rerun_from_failure set to True. (Raise an error)
Keep it as it is and only show a warning when try_number > 1. Because in this case, in the first run, users will be able to use those overrides, and then the rerun would also do the same on the dbt cloud side by just rerunning the previous run as explained in the docs.

For me, it feels like second approach is more suitable as we do not limit the users, but it all depends on the try_number and can make it more complicated to understand.

Let me know what you think :)

Lee-W · 2024-04-29T13:46:37Z

@boraberke Just want to confirm the second point you mentioned. Because the first run already uses steps_override, schema_override, additional_run_config, when we rerun, it'll use the configuration from the last run which contains steps_override, schema_override, additional_run_config. If that the case, I would say method 2 is better

Lee-W

Let's update this PR based on our discussion. 🚀 Thanks!

boraberke · 2024-04-30T08:39:34Z

@Lee-W, I will double-check if rerun uses the overrides, i.e. runs exactly the way the first job run, then add the warning or error accordingly!

Thanks for your comments!

Lee-W · 2024-04-30T08:49:47Z

@Lee-W, I will double-check if rerun uses the overrides, i.e. runs exactly the way the first job run, then add the warning or error accordingly!

Thanks for your comments!

Many thanks! No urgent. Just change the state of this PR so that everyone would have a better understand on the status 🙂

…s True

boraberke · 2024-05-01T16:46:05Z

@Lee-W,

I have tested and here how it works:

When first run already uses steps_override, schema_override, additional_run_config and the first run is failed, rerunning it will use the same config including steps_override, schema_override, additional_run_config.

However, if the first run with steps_override, schema_override, additional_run_config is finished succesfully, rerunning it does not include any of those. Thus, it does not re-uses the config of the previous run.

This means in any case where the operator had failed but dbt job is successful, when try_number > 1, it would run without the overrides.

I think it is best to raise an error if any of steps_override, schema_override, additional_run_config provided when retry_from_failure is True. I have updated this PR accordingly.

Let me know what you think!

Lee-W · 2024-05-02T09:38:48Z

I think it is best to raise an error if any of steps_override, schema_override, additional_run_config provided when retry_from_failure is True. I have updated this PR accordingly.

Indeed, I think this is what we should do. Thanks for the testing and update!

Lee-W

left a few nitpicks, but overall it looks great!

airflow/providers/dbt/cloud/hooks/dbt.py

tests/providers/dbt/cloud/hooks/test_dbt.py

boraberke · 2024-05-02T18:52:46Z

@Lee-W, thanks for the review! Fixed upon your latest comments as well :)

Lee-W

Looks good to me. Thanks @boraberke !

josh-fell

@boraberke Thanks for the contribution, killer stuff! This will be a great addition to the provider.

I'm trying to understand the what happens if a user sets retry_from_failure=True on the operator and provides either steps_override, schema_override, or additional_run_config initially and the task is naturally retried in Airflow. It seems like with the most recent changes, the task would fail because those args were supplied originally once retry_from_failure() is called in the DbtCloudHook. Can you clarify that for me?

If yes, maybe it's worth adding the try_number check to the DbtCloudHook.trigger_job_run() method using get_current_context() too and then raise a warning instead of an error? We wouldn't want to have users setup a task correctly only for it to fail because the task retried. Although it might seem redundant, adding the check again, would help keep the same functionality you propose, but applicable to users using DbtCloudHook.trigger_job_run() directly without using the operator.

Another scenario I'm thinking about, albeit a rare one presumably, relative to the try_number check: let's say the same task configured with retry_from_failure and an override previously succeeded, but a user wants to clear the task so the dbt Cloud runs again because of some upstream/downstream issue in their pipelines. I would suspect a user would think that the task isn't being "retried" from a failure context, but just wants to run it again. The overrides wouldn't be used (assuming the logic is updated to a warning from above).

Maybe to alleviate both scenarios, when retry_from_failure=True, the trigger_job_run() method actually retrieves the job's status from dbt Cloud, assesses whether or not to call the retry endpoint based on success/failure? This would completely remove using Airflow internals to control how the job triggering behaves.

boraberke requested a review from josh-fell as a code owner April 9, 2024 14:57

boring-cyborg bot added area:providers provider:dbt-cloud labels Apr 9, 2024

boraberke force-pushed the feat/dbt-retry-from-failure branch from 0696015 to 2ad17e3 Compare April 9, 2024 17:46

Add retry_from_failure parameter to DbtCloudRunJobOperator

ca504ff

boraberke force-pushed the feat/dbt-retry-from-failure branch from 2ad17e3 to ca504ff Compare April 9, 2024 17:47

eladkal requested a review from Lee-W April 26, 2024 07:12

Lee-W reviewed Apr 26, 2024

View reviewed changes

airflow/providers/dbt/cloud/hooks/dbt.py Outdated Show resolved Hide resolved

boraberke and others added 3 commits April 28, 2024 23:05

Merge branch 'apache:main' into feat/dbt-retry-from-failure

a26397e

Use rerun endpoint only when ti.try_number is greater than 1

8050381

Fix docstring links

d796da0

Lee-W self-requested a review April 29, 2024 01:51

Lee-W requested changes Apr 30, 2024

View reviewed changes

Do not allow override parameters to be used when retry_from_failure i…

0e31743

…s True

boraberke and others added 2 commits May 1, 2024 20:09

Merge branch 'main' into feat/dbt-retry-from-failure

8c1d17f

Fix base endpoint url prefix

e247db0

Lee-W reviewed May 2, 2024

View reviewed changes

airflow/providers/dbt/cloud/hooks/dbt.py Show resolved Hide resolved

tests/providers/dbt/cloud/hooks/test_dbt.py Show resolved Hide resolved

tests/providers/dbt/cloud/hooks/test_dbt.py Outdated Show resolved Hide resolved

Split test cases and update docstring

b307058

Lee-W approved these changes May 3, 2024

View reviewed changes

josh-fell requested changes May 4, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add `retry_from_failure` parameter to DbtCloudRunJobOperator #38868

Add `retry_from_failure` parameter to DbtCloudRunJobOperator #38868

boraberke commented Apr 9, 2024 •

edited

boring-cyborg bot commented Apr 9, 2024

boraberke commented Apr 9, 2024

boraberke commented Apr 25, 2024

Lee-W commented Apr 26, 2024

boraberke commented Apr 28, 2024 •

edited

Lee-W commented Apr 29, 2024

boraberke commented Apr 29, 2024

Lee-W commented Apr 29, 2024

Lee-W left a comment

boraberke commented Apr 30, 2024

Lee-W commented Apr 30, 2024

boraberke commented May 1, 2024

Lee-W commented May 2, 2024

Lee-W left a comment

boraberke commented May 2, 2024

Lee-W left a comment

josh-fell left a comment

Add retry_from_failure parameter to DbtCloudRunJobOperator #38868

Are you sure you want to change the base?

Add retry_from_failure parameter to DbtCloudRunJobOperator #38868

Conversation

boraberke commented Apr 9, 2024 • edited

boring-cyborg bot commented Apr 9, 2024

boraberke commented Apr 9, 2024

boraberke commented Apr 25, 2024

Lee-W commented Apr 26, 2024

boraberke commented Apr 28, 2024 • edited

Lee-W commented Apr 29, 2024

boraberke commented Apr 29, 2024

Lee-W commented Apr 29, 2024

Lee-W left a comment

Choose a reason for hiding this comment

boraberke commented Apr 30, 2024

Lee-W commented Apr 30, 2024

boraberke commented May 1, 2024

Lee-W commented May 2, 2024

Lee-W left a comment

Choose a reason for hiding this comment

boraberke commented May 2, 2024

Lee-W left a comment

Choose a reason for hiding this comment

josh-fell left a comment

Choose a reason for hiding this comment

Add `retry_from_failure` parameter to DbtCloudRunJobOperator #38868

Add `retry_from_failure` parameter to DbtCloudRunJobOperator #38868

boraberke commented Apr 9, 2024 •

edited

boraberke commented Apr 28, 2024 •

edited