[dagster-dbt] Refactor row count collection code #21943

benpankow · 2024-05-17T18:38:02Z

Summary

In preparation for supporting other post-dbt-materialization logic, moves the bulk of the "map operation across dbt events" logic to a util. Stacked PR will move to implementing an even more generic "map" style operation on DbtEventIterator.

Test Plan

Existing unit tests.

benpankow · 2024-05-17T18:38:18Z

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @benpankow and the rest of your teammates on Graphite

rexledesma

See comments

rexledesma · 2024-05-21T16:10:41Z

python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py

@@ -697,7 +697,9 @@ def stream(
                def my_dbt_assets(context, dbt: DbtCliResource):
                    yield from dbt.cli(["run"], context=context).stream()
        """
-        return DbtEventIterator(self._stream_asset_events(), self)
+        return DbtEventIterator(
+            self._stream_asset_events(), self, ThreadPoolExecutor(STREAM_EVENTS_THREADPOOL_SIZE)


We should ensure that the threadpool's context is properly handled.

STREAM_EVENTS_THREADPOOL_SIZE should potentially be modifiable by the user. Perhaps it should be set similar to termination_timeout_seconds, so that it can be overridden if needed.

Updated to a num_threads param on each call which gives users the ability to control how much fan-out they want at each step & avoids us having to thread through the threadpool to each iterator, where managing lifecycle can be tricky (since no single iterator "owns" the pool). Since the number of chained calls is small, the overhead from opening a new pool a few more times should be minimal.

rexledesma · 2024-05-21T16:14:53Z

python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py

-        # as the one dbt uses, is open.
-        try:
-            from dbt.adapters.duckdb import DuckDBAdapter
+        with pushd(str(self._dbt_cli_invocation.project_dir)):


I'm afraid that this folder management scheme might bite us in the future...

In a separate PR, could we consider lumping this pushd behavior explicitly when the .adapter property of DbtCliInvocation is accessed?

python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py

python_modules/libraries/dagster-dbt/dagster_dbt/core/utils.py

rexledesma · 2024-05-21T19:27:02Z

python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py


    @public
    @experimental
    def fetch_row_counts(
-        self,
+        self, *, num_threads=DEFAULT_EVENT_POSTPROCESSING_THREADPOOL_SIZE


This feels off, since this implies that num_threads would also be an argument for other methods that emit metadata for the DbtEventIterator (e.g. .fetch_column_schema).

Could we instead just have it as a property on DbtCliInvocation?

I was planning to have it as a kwarg param on each of the chained/builder methods, so users could specify a different threadpool size for each.

The previous shared threadpool approach is tricky because of its lifecycle - there's no clear "owner" or with-context scope we can use. Could move the configuration option to DbtCliInvocation but still have separate threadpools?

Could move the configuration option to DbtCliInvocation but still have separate threadpools?

Yeah I think this is preferable.

python_modules/libraries/dagster-dbt/dagster_dbt/core/utils.py

rexledesma

I don't think we need to expose num_threads -- it should just be a property on DbtCliInvocation for now.

rexledesma · 2024-05-22T15:20:05Z

python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py


    @public
    @experimental
    def fetch_row_counts(
-        self,
+        self, *, num_threads=DEFAULT_EVENT_POSTPROCESSING_THREADPOOL_SIZE


Could move the configuration option to DbtCliInvocation but still have separate threadpools?

Yeah I think this is preferable.

python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py

queue

rexledesma · 2024-05-22T16:08:23Z

python_modules/libraries/dagster-dbt/dagster_dbt/core/resources_v2.py

+        ):
+            with ThreadPoolExecutor(
+                max_workers=self._dbt_cli_invocation.postprocessing_threadpool_num_threads,
+                thread_name_prefix="fetch_row_counts_",


I believe there's already a separator added, so no need to add that manually

Suggested change

thread_name_prefix="fetch_row_counts_",

thread_name_prefix="fetch_row_counts",

benpankow changed the title ~~refactor dbt mapping logic~~ [dagster-dbt] Refactor row count collection code May 17, 2024

benpankow requested review from sryza and rexledesma and removed request for sryza May 17, 2024 18:52

benpankow marked this pull request as ready for review May 17, 2024 18:52

benpankow mentioned this pull request May 17, 2024

[dagster-dbt] Add general attach_metadata method to chain async metadata fetches to dbt execution #21947

Merged

benpankow force-pushed the benpankow/generify-dbt-2 branch from 4fd65d3 to ab4cd89 Compare May 20, 2024 16:31

benpankow mentioned this pull request May 20, 2024

[dagster-dbt] add chained method to fetch column metadata, lineage async #21979

Merged

benpankow force-pushed the benpankow/generify-dbt-2 branch 2 times, most recently from 02be352 to abb8b7e Compare May 20, 2024 23:22

rexledesma previously requested changes May 21, 2024

View reviewed changes

benpankow force-pushed the benpankow/generify-dbt-2 branch from abb8b7e to b35eeb9 Compare May 21, 2024 17:15

benpankow requested a review from rexledesma May 21, 2024 18:26

rexledesma reviewed May 21, 2024

View reviewed changes

benpankow requested a review from rexledesma May 21, 2024 22:40

benpankow force-pushed the benpankow/generify-dbt-2 branch from 965d062 to 5285767 Compare May 21, 2024 23:04

rexledesma reviewed May 22, 2024

View reviewed changes

benpankow force-pushed the benpankow/generify-dbt-2 branch from 5285767 to d7fe9db Compare May 22, 2024 15:49

benpankow requested a review from rexledesma May 22, 2024 16:05

rexledesma approved these changes May 22, 2024

View reviewed changes

benpankow added 6 commits May 22, 2024 09:25

refactor dbt mapping logic

4bd8383

fix

734d6b7

use separate threadpools for each task

7239952

dequeue

afddf31

tweak threadpool

6fe6d5f

prefix

63b7dfe

prefix

a74481a

benpankow force-pushed the benpankow/generify-dbt-2 branch from 0a33daf to a74481a Compare May 22, 2024 16:26

benpankow merged commit b3cdcb2 into master May 22, 2024
1 check passed

benpankow deleted the benpankow/generify-dbt-2 branch May 22, 2024 18:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[dagster-dbt] Refactor row count collection code #21943

[dagster-dbt] Refactor row count collection code #21943

benpankow commented May 17, 2024 •

edited

benpankow commented May 17, 2024 •

edited

rexledesma left a comment

rexledesma May 21, 2024

benpankow May 21, 2024 •

edited

rexledesma May 21, 2024

rexledesma May 21, 2024

benpankow May 21, 2024

rexledesma May 22, 2024

rexledesma left a comment

rexledesma May 22, 2024

rexledesma May 22, 2024

	thread_name_prefix="fetch_row_counts_",
	thread_name_prefix="fetch_row_counts",

[dagster-dbt] Refactor row count collection code #21943

[dagster-dbt] Refactor row count collection code #21943

Conversation

benpankow commented May 17, 2024 • edited

Summary

Test Plan

benpankow commented May 17, 2024 • edited

rexledesma left a comment

Choose a reason for hiding this comment

rexledesma May 21, 2024

Choose a reason for hiding this comment

benpankow May 21, 2024 • edited

Choose a reason for hiding this comment

rexledesma May 21, 2024

Choose a reason for hiding this comment

rexledesma May 21, 2024

Choose a reason for hiding this comment

benpankow May 21, 2024

Choose a reason for hiding this comment

rexledesma May 22, 2024

Choose a reason for hiding this comment

rexledesma left a comment

Choose a reason for hiding this comment

rexledesma May 22, 2024

Choose a reason for hiding this comment

rexledesma May 22, 2024

Choose a reason for hiding this comment

benpankow commented May 17, 2024 •

edited

benpankow commented May 17, 2024 •

edited

benpankow May 21, 2024 •

edited