Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add retry_from_failure parameter to DbtCloudRunJobOperator #38868

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

boraberke
Copy link
Contributor

@boraberke boraberke commented Apr 9, 2024

This PR adds a new retry_from_failure parameter to the DbtRunJobOperator to retry a failed run of a dbt Cloud job from the point of failure. The implementation uses the new rerun endpoint in the dbt API which handles the lookup of the last run for a given job itself and decides whether to start a new run of the job or not.

New endpoint is only used when retry_from_failure is True and try_number of the task is greater than 1. It also cannot be used in conjunction with steps_override, schema_override and additional_run_config.

Closes: #35772
See also: #38001

Copy link

boring-cyborg bot commented Apr 9, 2024

Congratulations on your first Pull Request and welcome to the Apache Airflow community! If you have any issues or are unsure about any anything please check our Contributors' Guide (https://github.com/apache/airflow/blob/main/contributing-docs/README.rst)
Here are some useful points:

  • Pay attention to the quality of your code (ruff, mypy and type annotations). Our pre-commits will help you with that.
  • In case of a new feature add useful documentation (in docstrings or in docs/ directory). Adding a new operator? Check this short guide Consider adding an example DAG that shows how users should use it.
  • Consider using Breeze environment for testing locally, it's a heavy docker but it ships with a working Airflow and a lot of integrations.
  • Be patient and persistent. It might take some time to get a review or get the final approval from Committers.
  • Please follow ASF Code of Conduct for all communication including (but not limited to) comments on Pull Requests, Mailing list and Slack.
  • Be sure to read the Airflow Coding style.
  • Always keep your Pull Requests rebased, otherwise your build might fail due to changes not related to your commits.
    Apache Airflow is a community-driven project and together we are making it better 🚀.
    In case of doubts contact the developers at:
    Mailing List: dev@airflow.apache.org
    Slack: https://s.apache.org/airflow-slack

@boraberke
Copy link
Contributor Author

Hey @josh-fell, I've created this PR based on your comments here. I would really appreciate it if you could take a look at it!

@boraberke boraberke force-pushed the feat/dbt-retry-from-failure branch from 0696015 to 2ad17e3 Compare April 9, 2024 17:46
@boraberke boraberke force-pushed the feat/dbt-retry-from-failure branch from 2ad17e3 to ca504ff Compare April 9, 2024 17:47
@boraberke
Copy link
Contributor Author

Hi @josh-fell, did you have a chance to look at this PR? Would appreciate your comments!

@eladkal eladkal requested a review from Lee-W April 26, 2024 07:12
@Lee-W
Copy link
Member

Lee-W commented Apr 26, 2024

rerun endpoint does not accept body, which means parameters like steps_override, schema_override, threads_override, cause cannot be passed. Current implementation always uses rerun endpoint if retry_from_failure is set to True. To overcome this issue, rerun endpoint can be used only if the task is retried (i.e. ti.try_number !=1).

Maybe we can check whether steps_override, schema_override, threads_override are provided with retry_from_failure and raise an error if so. Also, did you implement the ti.try_number part already?

@boraberke
Copy link
Contributor Author

boraberke commented Apr 28, 2024

rerun endpoint does not accept body, which means parameters like steps_override, schema_override, threads_override, cause cannot be passed. Current implementation always uses rerun endpoint if retry_from_failure is set to True. To overcome this issue, rerun endpoint can be used only if the task is retried (i.e. ti.try_number !=1).

Maybe we can check whether steps_override, schema_override, threads_override are provided with retry_from_failure and raise an error if so. Also, did you implement the ti.try_number part already?

rerun endpoint does not accept body, which means parameters like steps_override, schema_override, threads_override, cause cannot be passed. Current implementation always uses rerun endpoint if retry_from_failure is set to True. To overcome this issue, rerun endpoint can be used only if the task is retried (i.e. ti.try_number !=1).

Maybe we can check whether steps_override, schema_override, threads_override are provided with retry_from_failure and raise an error if so. Also, did you implement the ti.try_number part already?

Hi @Lee-W,

I have also implemented the try_number part, could you please take a look at it as well?

And also, if steps_override, schema_override, threads_override are provided with retry_from_failure, should it be a warning or an error? Displaying a warning and discarding the values of overrides might also be an option.

@Lee-W Lee-W self-requested a review April 29, 2024 01:51
@Lee-W
Copy link
Member

Lee-W commented Apr 29, 2024

I have also implemented the try_number part, could you please take a look at it as well?

Yep, just found it

And also, if steps_override, schema_override, threads_override are provided with retry_from_failure, should it be a warning or an error? Displaying a warning and discarding the values of overrides might also be an option.

I feel like an error might makes more sense 🤔 I don't personally use dbt that much, but I guess steps_override, schema_override, threads_override could significantly change the behavior somehow. If that's the case, it might be better if we raise an error. But please correct me if I'm wrong 🙂 Thanks!

@boraberke
Copy link
Contributor Author

I feel like an error might makes more sense 🤔 I don't personally use dbt that much, but I guess steps_override, schema_override, threads_override could significantly change the behavior somehow. If that's the case, it might be better if we raise an error. But please correct me if I'm wrong 🙂 Thanks!

Yes, those parameters change the behavior significantly. My only concern is with try_number > 1 check, these parameters can actually work in the first run, i.e. try_number = 1. We can either;

  1. Do not allow users to use steps_override, schema_override, additional_run_config when rerun_from_failure set to True. (Raise an error)

  2. Keep it as it is and only show a warning when try_number > 1. Because in this case, in the first run, users will be able to use those overrides, and then the rerun would also do the same on the dbt cloud side by just rerunning the previous run as explained in the docs.

For me, it feels like second approach is more suitable as we do not limit the users, but it all depends on the try_number and can make it more complicated to understand.

Let me know what you think :)

@Lee-W
Copy link
Member

Lee-W commented Apr 29, 2024

@boraberke Just want to confirm the second point you mentioned. Because the first run already uses steps_override, schema_override, additional_run_config, when we rerun, it'll use the configuration from the last run which contains steps_override, schema_override, additional_run_config. If that the case, I would say method 2 is better

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's update this PR based on our discussion. 🚀 Thanks!

@boraberke
Copy link
Contributor Author

@Lee-W, I will double-check if rerun uses the overrides, i.e. runs exactly the way the first job run, then add the warning or error accordingly!

Thanks for your comments!

@Lee-W
Copy link
Member

Lee-W commented Apr 30, 2024

@Lee-W, I will double-check if rerun uses the overrides, i.e. runs exactly the way the first job run, then add the warning or error accordingly!

Thanks for your comments!

Many thanks! No urgent. Just change the state of this PR so that everyone would have a better understand on the status 🙂

@boraberke
Copy link
Contributor Author

@Lee-W,

I have tested and here how it works:

When first run already uses steps_override, schema_override, additional_run_config and the first run is failed, rerunning it will use the same config including steps_override, schema_override, additional_run_config.

However, if the first run with steps_override, schema_override, additional_run_config is finished succesfully, rerunning it does not include any of those. Thus, it does not re-uses the config of the previous run.

This means in any case where the operator had failed but dbt job is successful, when try_number > 1, it would run without the overrides.

I think it is best to raise an error if any of steps_override, schema_override, additional_run_config provided when retry_from_failure is True. I have updated this PR accordingly.

Let me know what you think!

@Lee-W
Copy link
Member

Lee-W commented May 2, 2024

I think it is best to raise an error if any of steps_override, schema_override, additional_run_config provided when retry_from_failure is True. I have updated this PR accordingly.

Indeed, I think this is what we should do. Thanks for the testing and update!

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left a few nitpicks, but overall it looks great!

airflow/providers/dbt/cloud/hooks/dbt.py Show resolved Hide resolved
tests/providers/dbt/cloud/hooks/test_dbt.py Show resolved Hide resolved
tests/providers/dbt/cloud/hooks/test_dbt.py Outdated Show resolved Hide resolved
@boraberke
Copy link
Contributor Author

@Lee-W, thanks for the review! Fixed upon your latest comments as well :)

Copy link
Member

@Lee-W Lee-W left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Thanks @boraberke !

Copy link
Contributor

@josh-fell josh-fell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@boraberke Thanks for the contribution, killer stuff! This will be a great addition to the provider.

I'm trying to understand the what happens if a user sets retry_from_failure=True on the operator and provides either steps_override, schema_override, or additional_run_config initially and the task is naturally retried in Airflow. It seems like with the most recent changes, the task would fail because those args were supplied originally once retry_from_failure() is called in the DbtCloudHook. Can you clarify that for me?

If yes, maybe it's worth adding the try_number check to the DbtCloudHook.trigger_job_run() method using get_current_context() too and then raise a warning instead of an error? We wouldn't want to have users setup a task correctly only for it to fail because the task retried. Although it might seem redundant, adding the check again, would help keep the same functionality you propose, but applicable to users using DbtCloudHook.trigger_job_run() directly without using the operator.

Another scenario I'm thinking about, albeit a rare one presumably, relative to the try_number check: let's say the same task configured with retry_from_failure and an override previously succeeded, but a user wants to clear the task so the dbt Cloud runs again because of some upstream/downstream issue in their pipelines. I would suspect a user would think that the task isn't being "retried" from a failure context, but just wants to run it again. The overrides wouldn't be used (assuming the logic is updated to a warning from above).

Maybe to alleviate both scenarios, when retry_from_failure=True, the trigger_job_run() method actually retrieves the job's status from dbt Cloud, assesses whether or not to call the retry endpoint based on success/failure? This would completely remove using Airflow internals to control how the job triggering behaves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

dbt Cloud provider to support retry from point of failure
3 participants