Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Change how the profiles.yml is created when using mocked profiles #924

Closed
4 tasks
tatiana opened this issue Apr 29, 2024 · 1 comment
Closed
4 tasks

Change how the profiles.yml is created when using mocked profiles #924

tatiana opened this issue Apr 29, 2024 · 1 comment
Assignees
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc area:profile Related to ProfileConfig, like Athena, BigQuery, Clickhouse, Spark, Trino, etc dbt:list Primarily related to dbt list command or functionality epic-assigned good first issue Good for newcomers parsing:dbt_ls Issues, questions, or features related to dbt_ls parsing profile:postgres Related to Postgres ProfileConfig
Milestone

Comments

@tatiana
Copy link
Collaborator

tatiana commented Apr 29, 2024

Context

As of Cosmos 1.3, when users use Cosmos ProfileMapping we create the profiles.yml file in two different ways:

The mocked approach was introduced so users could validate their DAGs in their CI without creating Airflow connections and without exposing database credentials.

The challenge with this approach is that, when using partial parsing, dbt expects the profiles.yml to remain unchanged, as observed during #904. On the positive side, dbt is happy to use partial parse if only the content of the environment variable changes (validated with dbt 1.7.10).

This PR #904 worked around the issue by allowing users to leverage partial parsing to avoid creating mocked profiles (using RenderConfig(enable_mock_profile=False)). However, users must ensure the CI has the desired connection (potentially with fake values). The goal of this ticket is to overcome this issue.

Acceptance criteria

  • Cosmos should create the dbt profiles.yml in the same way, referencing environment variables
  • When using mock, Cosmos should use mocked values in the environment variables
  • Users should still be able to parse the Cosmos DAG that uses ProfileMapping by running airflow dags list when no connection was created
  • Cosmos users should be able to use partial parsing during DAG building with dbt ls even when not usingRenderConfig(enable_mock_profile=False).

To validate the last criteria, the developer can change the example DAG dev/dags/basic_cosmos_task_group.py, by removing enable_mock_profile=False, run it twice with a command similar to:

AIRFLOW__COSMOS__CACHE_DIR=/tmp/
AIRFLOW__COSMOS__ENABLE_CACHE=1 AIRFLOW__LOGGING__LOGGING_LEVEL=DEBUG AIRFLOW__COSMOS__PROPAGATE_LOGS=0 AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 airflow dags test basic_cosmos_task_group  `date -Iseconds`

And confirm that the second time, it does not print "Unable to do partial parsing"

@tatiana tatiana added good first issue Good for newcomers area:performance Related to performance, like memory usage, CPU usage, speed, etc area:profile Related to ProfileConfig, like Athena, BigQuery, Clickhouse, Spark, Trino, etc labels Apr 29, 2024
@tatiana tatiana changed the title Change mocked profiles to be mocked Change how the profiles.yml is created when using mocked profiles Apr 29, 2024
@dosubot dosubot bot added dbt:list Primarily related to dbt list command or functionality parsing:dbt_ls Issues, questions, or features related to dbt_ls parsing profile:postgres Related to Postgres ProfileConfig labels Apr 29, 2024
@tatiana tatiana added this to the 1.5.0 milestone Apr 30, 2024
tatiana added a commit that referenced this issue May 1, 2024
…ct (#904)

Improve the performance to run the benchmark DAG with 100 tasks by 34%
and the benchmark DAG with 10 tasks by 22%, by persisting the dbt
partial parse artifact in Airflow nodes. This performance can be even
higher in the case of dbt projects that take more time to be parsed.

With the introduction of #800, Cosmos supports using dbt partial parsing
files. This feature has led to a substantial performance improvement,
particularly for large dbt projects, both during Airflow DAG parsing
(using LoadMode.DBT_LS) and also Airflow task execution (when using
`ExecutionMode.LOCAL` and `ExecutionMode.VIRTUALENV`).

There were two limitations with the initial support to partial parsing,
which the current PR aims to address:

1. DAGs using Cosmos `ProfileMapping` classes could not leverage this
feature. This is because the partial parsing relies on profile files not
changing, and by default, Cosmos would mock the dbt profile in several
parts of the code. The consequence is that users trying Cosmos 1.4.0a1
will see the following message:
```
13:33:16  Unable to do partial parsing because profile has changed
13:33:16  Unable to do partial parsing because env vars used in profiles.yml have changed
```

2. The user had to explicitly provide a `partial_parse.msgpack` file in
the original project folder for their Airflow deployment - and if, for
any reason, this became outdated, the user would not leverage the
partial parsing feature. Since Cosmos runs dbt tasks from within a
temporary directory, the partial parse would be stale for some users, it
would be updated in the temporary directory, but the next time the task
was run, Cosmos/dbt would not leverage the recently updated
`partial_parse.msgpack` file.

The current PR addresses these two issues respectfully by:

1. Allowing users that want to leverage Cosmos `ProfileMapping` and
partial parsing to use `RenderConfig(enable_mock_profile=False)`

2. Introducing a Cosmos cache directory where we are persisting partial
parsing files. This feature is enabled by default, but users can opt out
by setting the Airflow configuration `[cosmos][enable_cache] = False`
(exporting the environment variable `AIRFLOW__COSMOS__ENABLE_CACHE=0`).
Users can also define the temporary directory used to store these files
using the `[cosmos][cache_dir]` Airflow configuration. By default,
Cosmos will create and use a folder `cosmos` inside the system's
temporary directory:
https://docs.python.org/3/library/tempfile.html#tempfile.gettempdir .

This PR affects both DAG parsing and task execution. Although it does
not introduce an optimisation per se, it makes the partial parse feature
implemented #800 available to more users.

Closes: #722

I updated the documentation in the PR: #898

Some future steps related to optimization associated to caching to be
addressed in separate PRs:
i. Change how we create mocked profiles, to create the file itself in
the same way, referencing an environment variable with the same name -
and only changing the value of the environment variable (#924)
ii. Extend caching to the `profiles.yml` created by Cosmos in the newly
introduced `tmp/cosmos` without the need to recreate it every time
(#925).
iii. Extend caching to the Airflow DAG/Task group as a pickle file -
this approach is more generic and would work for every type of DAG
parsing and executor. (#926)
iv. Support persisting/fetching the cache from remote storage so we
don't have to replicate it for every Airflow scheduler and worker node.
(#927)
v. Cache dbt deps lock file/avoid installing dbt steps every time. We
can leverage `package-lock.yml` introduced in dbt t 1.7
(https://docs.getdbt.com/reference/commands/deps#predictable-package-installs),
but ideally, we'd have a strategy to support older versions of dbt as
well. (#930)
vi. Support caching `partial_parse.msgpack` even when vars change:
https://medium.com/@sebastian.daum89/how-to-speed-up-single-dbt-invocations-when-using-changing-dbt-variables-b9d91ce3fb0d
vii. Support partial parsing in Docker and Kubernetes Cosmos executors
(#929)
viii. Centralise all the Airflow-based config into Cosmos settings.py &
create a dedicated docs page containing information about these (#928)

**How to validate this change**

Run the performance benchmark against this and the `main` branch,
checking the value of `/tmp/performance_results.txt`.

Example of commands run locally:

```
# Setup
AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance-setup

# Run test for 100 dbt models per DAG:
MODEL_COUNT=100 AIRFLOW_HOME=`pwd` AIRFLOW_CONN_AIRFLOW_DB="postgres://postgres:postgres@0.0.0.0:5432/postgres" PYTHONPATH=`pwd` AIRFLOW_HOME=`pwd` AIRFLOW__CORE__DAGBAG_IMPORT_TIMEOUT=20000 AIRFLOW__CORE__DAG_FILE_PROCESSOR_TIMEOUT=20000 hatch run tests.py3.11-2.7:test-performance
```

An example of output when running 100 with the main branch:
```
NUM_MODELS=100
TIME=114.18614888191223
MODELS_PER_SECOND=0.8757629623135543
DBT_VERSION=1.7.13
```

And with the current PR:
```
NUM_MODELS=100
TIME=75.17766404151917
MODELS_PER_SECOND=1.33018232576064
DBT_VERSION=1.7.13
```
@tatiana tatiana added triage-needed Items need to be reviewed / assigned to milestone and removed triage-needed Items need to be reviewed / assigned to milestone labels May 17, 2024
@pankajastro
Copy link
Contributor

pankajastro commented May 29, 2024

We tested it, but unfortunately, the partial parsing does not work even when the environment variable value changes. In the example below, I change the POSTGRES_HOST environment variable value, but partial parsing still does not work.

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % cat .env 
POSTGRES_HOST=0.0.0.0
POSTGRES_USER=postgres
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_SCHEMA=public

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % source .env 

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % dbt ls --target-path dev --profiles-dir dev/dags/dbt/perf/ --project-dir dev/dags/dbt/perf/
20:12:55  Running with dbt=1.8.1
20:12:55  Registered adapter: postgres=1.8.0
Manifestpath:  dev/dags/dbt/perf/dev/partial_parse.msgpack
20:12:56  Unable to do partial parsing because profile has changed
20:12:56  Found 413 macros
20:12:56  No nodes selected!

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % dbt ls --target-path dev --profiles-dir dev/dags/dbt/perf/ --project-dir dev/dags/dbt/perf/
20:13:01  Running with dbt=1.8.1
20:13:01  Registered adapter: postgres=1.8.0
Manifestpath:  dev/dags/dbt/perf/dev/partial_parse.msgpack
20:13:01  Found 413 macros
20:13:01  No nodes selected!

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % source .env                                                                                

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % cat .env                                                                                   
POSTGRES_HOST=localhost
POSTGRES_USER=postgres
POSTGRES_PORT=5432
POSTGRES_DB=postgres
POSTGRES_SCHEMA=public

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % dbt ls --target-path dev --profiles-dir dev/dags/dbt/perf/ --project-dir dev/dags/dbt/perf/
20:13:30  Running with dbt=1.8.1
20:13:30  Registered adapter: postgres=1.8.0
Manifestpath:  dev/dags/dbt/perf/dev/partial_parse.msgpack
20:13:31  Unable to do partial parsing because profile has changed
20:13:31  Found 413 macros
20:13:31  No nodes selected!

(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % cat dev/dags/dbt/perf/profiles.yml 
default:
  target: dev
  outputs:
    dev:
      type: postgres
      host: "{{ env_var('POSTGRES_HOST') }}"
      user: "{{ env_var('POSTGRES_USER') }}"
      password: "{{ env_var('POSTGRES_PASSWORD') }}"
      port: "{{ env_var('POSTGRES_PORT') | int }}"
      dbname: "{{ env_var('POSTGRES_DB') }}"
      schema: "{{ env_var('POSTGRES_SCHEMA') }}"
      threads: 4
(devenv) pankaj@Pankajs-MacBook-Pro astronomer-cosmos % 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:performance Related to performance, like memory usage, CPU usage, speed, etc area:profile Related to ProfileConfig, like Athena, BigQuery, Clickhouse, Spark, Trino, etc dbt:list Primarily related to dbt list command or functionality epic-assigned good first issue Good for newcomers parsing:dbt_ls Issues, questions, or features related to dbt_ls parsing profile:postgres Related to Postgres ProfileConfig
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants