Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for virtual env directory flag #611

Open
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

LennartKloppenburg
Copy link

Description

Added virtualenv_dir as an option to ExecutionConfig which is then propagated downstream to DbtVirtualenvBaseOperator.

The following now happens:

  • If the flag is set, the operator will attempt to locate the venv's python binary under the provided virtualenv_dir.
    • If so, it will conclude that the venv exists and continues without creating a new one.
    • If not, it will create a new one at virtualenv_dir
  • If the flag is not set, simply continue using the temporary directory solution that was already in place.

Impact

A very basic test using a local docker compose set-up as per the contribution guide and the example_virtualenv DAG saw the DAG's runtime go down from 2m31s to just 32s. I'd this improvement to be even more noticeable with more complex graphs and more python requirements.

Related Issue(s)

Implements #610

Breaking Change?

None, the flag is optional and is ignored (with a warning) when used outside of VirtualEnv execution mode.

Checklist

  • I have made corresponding changes to the documentation (if required)
  • I have added tests that prove my fix is effective or that my feature works

@netlify
Copy link

netlify bot commented Oct 18, 2023

👷 Deploy Preview for amazing-pothos-a3bca0 processing.

Name Link
🔨 Latest commit be0de1a
🔍 Latest deploy log https://app.netlify.com/sites/amazing-pothos-a3bca0/deploys/6582c61c78d17900084dc3d7

@pre-commit-ci pre-commit-ci bot temporarily deployed to external October 18, 2023 14:10 Inactive
@LennartKloppenburg LennartKloppenburg marked this pull request as ready for review October 18, 2023 14:13
@codecov
Copy link

codecov bot commented Oct 18, 2023

Codecov Report

Attention: Patch coverage is 83.67347% with 8 lines in your changes are missing coverage. Please review.

Project coverage is 93.06%. Comparing base (090116e) to head (87c5da0).
Report is 109 commits behind head on main.

Current head 87c5da0 differs from pull request most recent head be0de1a

Please upload reports for the commit be0de1a to get more accurate results.

Files Patch % Lines
cosmos/operators/virtualenv.py 81.39% 8 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #611      +/-   ##
==========================================
- Coverage   93.28%   93.06%   -0.23%     
==========================================
  Files          55       54       -1     
  Lines        2502     2163     -339     
==========================================
- Hits         2334     2013     -321     
+ Misses        168      150      -18     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@tatiana tatiana temporarily deployed to external October 18, 2023 20:33 — with GitHub Actions Inactive
Copy link
Collaborator

@tatiana tatiana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for creating this PR so quickly, @LennartKloppenburg ! This is looking very good.

I added some comments in line, and I have a gut feeling we may need to add some additional tests to cover the possible behaviours in _get_or_create_venv_py_interpreter.

We can aim to release this change as part of 1.2.1 (if we consider it a bugfix) or 1.3 (if we consider it a new feature) 🎉

cosmos/config.py Show resolved Hide resolved
cosmos/converter.py Show resolved Hide resolved
cosmos/converter.py Outdated Show resolved Hide resolved
self.log.info(f"Checking for venv interpreter: {py_interpreter_path} : {py_interpreter_path.is_file()}")
if py_interpreter_path.is_file():
self.log.info(f"Found Python interpreter in cached virtualenv: `{str(py_interpreter_path)}`")
return str(py_interpreter_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, does it still make sense to install any potential dependencies/update them - if there were requirement changes?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah ideally we'd be able to clean up the virtual env after the DAG run, but for the reasons you mentioned before this can be tricky.
One way to "perhaps" invalidate the virtualenv is to check when it was created and, after say 24 hours or 48 or so, have this operator clean it up and recreate it?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time-based approach could lead to some strange scenarios and be tricky to troubleshoot. How feasible would be for us to run a pip install in an existing virtualenv? It should be very quick if it was previously setup, and it would make the operator reliable.

Regarding the cleanup - I know - ideally we'd be able to set the venv only once during the DAG setup and delete during tear down. Unfortunately - to my knowledge - even the latest Airflow (2.7) does not allow us to have a setup/tear down per worker node during the lifecycle of a DAG. But this can be an improvement for the future - in a separate PR!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the late response here!

so the underlying prepare_virtualenv that we are "avoiding" after determining it's already there is imported from Airflow core (airflow.utils.python_virtualenv). That little helper also takes into account the python requirements so if we bypass this helper, we can't inject requirements unless we repeat the logic over here:

    ...
    pip_cmd = None
    if requirements is not None and len(requirements) != 0:
        pip_cmd = _generate_pip_install_cmd_from_list(venv_directory, requirements, pip_install_options)
    if requirements_file_path is not None and requirements_file_path:
        pip_cmd = _generate_pip_install_cmd_from_file(
            venv_directory, requirements_file_path, pip_install_options
        )

    if pip_cmd:
        execute_in_subprocess(pip_cmd)
    ...

What do you think?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In our case, we probably would only need to do part of the logic:

    if requirements is not None and len(requirements) != 0:
        pip_cmd = _generate_pip_install_cmd_from_list(venv_directory, requirements, pip_install_options)

Since we don't support requirements_file_path.

If we don't want to add this call unnecessarily, we'd probably need a pip freeze call - to confirm if the desired dependencies are already installed, which may be more work.

We probably need one or both of these. Otherwise, we're at the risk of an Airflow worker having partial/outdated dependencies that are incompatible with the dependencies the user requested.

I'm in favour of us caching for performance reasons, but we still should aim to have the task being idempotent.

dev/dags/example_virtualenv.py Outdated Show resolved Hide resolved
**kwargs: Any,
) -> None:
self.py_requirements = py_requirements or []
self.py_system_site_packages = py_system_site_packages
super().__init__(**kwargs)
self._venv_dir = virtualenv_dir
self._venv_tmp_dir: None | TemporaryDirectory[str] = None

@cached_property
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general thought: do we still want to cache this property? Is there any risk that we could end up caching the incorrect path?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is this property cached? If people are debugging or want to pass in more dynamically configured directories, I don't know how this decorator behaves :) Is it per task_id per dagrun_id or is it more persistent?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The property is cached while the Python process is alive.

@LennartKloppenburg
Copy link
Author

@tatiana I've updated the PR with some changes you've requested :)

One lingering issue:
When there's no virtual env handy and multiple tasks are executed concurrently, they will all attempt the same checks (checking for the dir, for the virtual env, to install stuff etc.) which leads to all kinds of issues because they don't automatically "wait" for the venv to be created by one single operator. This occurs in environments that schedule multiple DBT tasks simultaneously -- which is obviously very common.

The issue can then be resolved by retrying these tasks with some retry_delay, which will achieve the same result as "waiting" for the virtual env to be provisioned would have done.
What do you think of this?

pre-commit-ci bot and others added 9 commits December 17, 2023 18:54
… Rendering and Execution (astronomer#634)

This MR finishes the work that was started in astronomer#605 to add full support
for ProjectConfig.dbt_project_path = None, and implements astronomer#568.

Within this PR, several things have been updated:
1 - Added project_path fields to RenderConfig and ExecutionConfig
2 - Simplified the consumption of RenderConfig in the dbtGraph class
3 - added option to configure different dbt executables for Rendering vs
Execution.

Closes: astronomer#568
If execution_config was reused, Cosmos 1.2.2 would raise:

```
astronomer-cosmos/dags/basic_cosmos_task_group.py
Traceback (most recent call last):
  File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/venv-38/lib/python3.8/site-packages/airflow/models/dagbag.py", line 343, in parse
    loader.exec_module(new_module)
  File "<frozen importlib._bootstrap_external>", line 848, in exec_module
  File "<frozen importlib._bootstrap>", line 219, in _call_with_frames_removed
  File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/dags/basic_cosmos_task_group.py", line 74, in <module>
    basic_cosmos_task_group()
  File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/venv-38/lib/python3.8/site-packages/airflow/models/dag.py", line 3817, in factory
    f(**f_kwargs)
  File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/dags/basic_cosmos_task_group.py", line 54, in basic_cosmos_task_group
    orders = DbtTaskGroup(
  File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/cosmos/airflow/task_group.py", line 26, in __init__
    DbtToAirflowConverter.__init__(self, *args, **specific_kwargs(**kwargs))
  File "/Users/tati/Code/cosmos-clean/astronomer-cosmos/cosmos/converter.py", line 113, in __init__
    raise CosmosValueError(
cosmos.exceptions.CosmosValueError: ProjectConfig.dbt_project_path is mutually exclusive with RenderConfig.dbt_project_path and ExecutionConfig.dbt_project_path.If using RenderConfig.dbt_project_path or ExecutionConfig.dbt_project_path, ProjectConfig.dbt_project_path should be None
```

This has been raised by an Astro customer and our field engineer, who
tried to run: https://github.com/astronomer/cosmos-demo
@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Dec 17, 2023
@LennartKloppenburg
Copy link
Author

@tatiana Just completed the rebase, saw some artifacts that trip up the tests, will look at those tomorrow :) !

@LennartKloppenburg
Copy link
Author

LennartKloppenburg commented Dec 20, 2023

@tatiana
I could use two more eyes on this (from the CI/CD tests):

FAILED tests/dbt/test_graph.py::test_load_via_dbt_ls_project_config_env_vars - cosmos.dbt.graph.CosmosLoadDbtException: Unable to find the dbt executable: dbt
FAILED tests/dbt/test_graph.py::test_load_via_dbt_ls_project_config_dbt_vars - cosmos.dbt.graph.CosmosLoadDbtException: Unable to find the dbt executable: dbt
FAILED tests/dbt/test_graph.py::test_load_via_dbt_ls_render_config_selector_arg_is_used - cosmos.dbt.graph.CosmosLoadDbtException: Unable to find the dbt executable: dbt

When I run the tests locally they pass, maybe I missed something while rebasing? I rebased so much that I no longer know where it was introduced :D
Thanks in advance!!

@tatiana tatiana added status:awaiting-reviewer The issue/PR is awaiting for a reviewer input and removed status:awaiting-author Issue/PR is under discussion and waiting for author's input labels Jan 9, 2024
@tatiana tatiana modified the milestones: 1.4.0, 1.5.0 Apr 25, 2024
@tatiana tatiana self-assigned this May 10, 2024
@tatiana
Copy link
Collaborator

tatiana commented May 10, 2024

Hi @LennartKloppenburg ! I'm sorry for the massive delay, I've been working on other projects and it has been hard to keep up with everything. I'm planning to get back to this PR next week, so we can try to release it as part of Cosmos 1.5

@tatiana tatiana added triage-needed Items need to be reviewed / assigned to milestone epic-assigned and removed triage-needed Items need to be reviewed / assigned to milestone labels May 17, 2024
@tatiana tatiana mentioned this pull request May 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
epic-assigned execution:virtualenv Related to Virtualenv execution environment size:L This PR changes 100-499 lines, ignoring generated files. status:awaiting-reviewer The issue/PR is awaiting for a reviewer input
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants