New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

refactoring-external-materialization #332

Draft

milicevica23 wants to merge 17 commits into duckdb:master from milicevica23:feature/refactor-external-materialization

Contributor

milicevica23 commented Feb 6, 2024

Hi @jwills,
I wanted to open a pull request to ask questions and potentially propose the refactoring.

I will try in the next few days to list all the requirements and showcase all the implementation of the current external materialization here: https://github.com/l-mds/demo-dbt-duckdb-external-materialization

I would also like to go through that list and implement the missing tests before I start to refactor to ensure that the API stays the same I would also have some questions and need support there.


          add some questions

7f9111b

Collaborator

jwills commented Feb 8, 2024

Ah these are all good questions, will inline some comments shortly

jwills reviewed

View reviewed changes

dbt/include/duckdb/macros/materializations/external.sql Outdated

@@ @@ -27,6 +30,8 @@ @@
                 {%- set backup_relation = make_backup_relation(target_relation, backup_relation_type) -%}
                 -- as above, the backup_relation should not already exist
                 {%- set preexisting_backup_relation = load_cached_relation(backup_relation) -%}
+                --What is grants here?

Collaborator

jwills Feb 8, 2024

GRANTS in the context of e.g. Snowflake or another cloud DWH. DuckDB doesn't have an equivalent concept yet, though it wouldn't surprise me to see it in the context of MotherDuck or Iceberg/Delta tables.

Contributor Author

milicevica23 Feb 10, 2024

I will skip it for now to simplify, but the access grants here are an exciting topic. We want to export from the current context/duckdb into external parquet, delta, iceberg, and register a view over this location so that the table can be referenced in the next steps.

dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved

dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved

dbt/include/duckdb/macros/materializations/external.sql Outdated

@@ @@ -76,9 +82,10 @@ @@
                 {{ drop_relation_if_exists(temp_relation) }}
                 -- register table into glue
+                --I dont know glue so can you explain me a bit about that?

Collaborator

jwills Feb 8, 2024

There's a lot of history here-- the ability to register an external table with the AWS Glue catalog (and thus make it accessible to e.g. AWS Athena) actually predates the concept of dbt-duckdb plugins (in the BasePlugin sense.) The original idea of plugins for external relations was to generalize this capability to other cataloging systems.

Contributor Author

milicevica23 Feb 10, 2024

Nice, I see; I will need you to support me from the requirements point of view because I am unsure if I can imagine all possible use cases implemented until now. I don't have so much experience in the AWS world so more azure, google cloud but happy to learn about glue and athena
I will skip this one till we don't agree on some structure of the code; then, we can speak about details and each use case.

dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved

dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved

dbt/include/duckdb/macros/materializations/external.sql Outdated

                 {%- set language = model['language'] -%}
                 {%- set target_relation = this.incorporate(type='view') %}
                 -- Continue as normal materialization
+                -- Why do we have temp and intermediate relation?

Collaborator

jwills Feb 8, 2024

The idea is to not destroy an existing relation that has the same name as this one (if it exists in the catalog) in case something fails during the construction of the relation defined by this model.

Contributor Author

milicevica23 Feb 10, 2024

i understand; we prepare everything and do all the transformation, and at the very end, we exchange the names

I understand how it fits into the context but am not sure what it means for me right now in the implementation, so make sure that i don't overlook that one in the implementation.

dbt/include/duckdb/macros/materializations/external.sql Outdated

                 {%- set language = model['language'] -%}
                 {%- set target_relation = this.incorporate(type='view') %}
                 -- Continue as normal materialization
+                -- Why do we have temp and intermediate relation?
+                -- What does load_cached_relation do?

Collaborator

jwills Feb 8, 2024

Built-in dbt method; dbt will do the equivalent of caching the DB catalog metadata it loaded during the run so as to minimize the number of round-trips it does to the database (less relevant for DuckDB, more relevant for e.g. MotherDuck.)

Contributor Author

milicevica23 Feb 10, 2024

I will have to take a look into MotherDuck here because I still don't understand how this fits here and what the places are where I have to be careful not to overlook the functionality. So please, if you see something that is against MotherDuck logic, make me aware of that.

Contributor Author

milicevica23 commented Feb 10, 2024

Hi @jwills,
Thank you. I will try to add my comments and further questions and, in the next few days, make a first poc of how I think this should work. Please be critical

The main goal, for now, is to showcase the code structure, then we can go into details.

To simplify a bit, I will move out from the scope for now:

glue registration -> I don't have enough experience with the glue, so I have to check it out and will come to that afterward (it should work before merging)
arrow batch streaming -> so the first version will be with the arrow Table, which is the exact materialization, rather than exporting in the batches. see https://twitter.com/milicevica23/status/1754802477575078041 (important for delta)
assume that I don't reference a model in the different run so that functionality from "register_upstream_external_models" is not needed for now (it should work before merging)


          add code structure

1014c82

Contributor Author

milicevica23 commented Feb 10, 2024 •

edited

So, I added the structure and will start with the simple implementation of the native plugin, which will hopefully show if this makes sense.

This is still a work in progress, but If you have some comments or questions I would be happy to answer them.

I have an general question. Is there some concept of the release candidate / nightly that people can try this before they start to use it :D i am not sure if i can get all 1-to-1 api in the first run

Collaborator

jwills commented Feb 13, 2024

So, I added the structure and will start with the simple implementation of the native plugin, which will hopefully show if this makes sense.

This is still a work in progress, but If you have some comments or questions I would be happy to answer them.

I have an general question. Is there some concept of the release candidate / nightly that people can try this before they start to use it :D i am not sure if i can get all 1-to-1 api in the first run

There isn't a nightly or some such thing as of yet; if we add it, we will add it as a net-new thing that is disabled by default (which is how we've done a bunch of other things, like e.g. the fast CSV loading for dbt seed)

milicevica23 and others added 2 commits

February 18, 2024 20:55


          Merge branch 'duckdb:master' into feature/refactor-external-materiali…

d94042e

…zation


          add native plugin with the upstream functionality

7e4b533

Contributor Author

milicevica23 commented Feb 20, 2024

Hi @jwills, can you take a look. I did the first draft of the native functionality and a new test which should cover some of the general cases

milicevica23 and others added 3 commits

February 20, 2024 21:14


          rename external_new to external

46b1f06


          Merge branch 'duckdb:master' into feature/refactor-external-materiali…

d4d8887

…zation


          simplify flow; simplify and addapt excel plugin

0dd9053

milicevica23 marked this pull request as draft

February 24, 2024 14:22

milicevica23 added 4 commits

February 25, 2024 09:14


          add sqlalchemy; adapt flow

405b929


          add stack overflow question for native

e2685a1


          add upstream model; start pre-commit

f93a126


          delete upstream copy

489b020

Contributor Author

milicevica23 commented Feb 25, 2024 •

edited

Hi @jwills,
I think I finished the first round of refactoring, so maybe you could take a look and see if this makes sense to you.
What I did:

refactored the external materialization so that the creation of the external tables is handled by the plugin
the native plugin is a new plugin which does the parquet, json, csv
adapted the Excel and sqlalchemy plugins to use new code flow.
tried to fulfill and make all test run that were in place without big change
adapter upstream macro to register needed tables and data frames

The new code flow provides a nice feature that the exported file can be pointed directly from the target destination. It sounds trivial for native format but it is especially important for the further implementations of the iceberg, delta, (big query, snowflake ?) adapters, which can now be implemented within external materializations.
It also provides a very nice convenient way to register df in the upstream macro

What is there still to do:
missing:

Glue integration -> currently, i skipped it but will take a look and would maybe need your help for this because I never used it
python part of the external materialization -> i don't know so much about this and didn't find tests with that so I am not sure how to go about this but will take a look
buenavista integration -> what is the status there? I have to see what should be supported and adapt to that
other general stuff that I overlooked?

has to do:

native plugin has to be initialized on its own as default, and I have to find a place where to do it
find a way that native plugin load doesnt go over the arrow format but directly
add more tests for native, excel, sqlalchemy and others -> we have to take use cases from the community in order to cover all cases (i need here your help to motivate people :D)
look into the upstream macro because i think that with the old implementation, not all cases were done correctly because there is the problem with the rendering -> will try to explain in the future but the https://stackoverflow.com/questions/22655031/jinja2-how-to-put-a-block-in-an-if-statement shows how the rendering works and i think that we didn't use it right
showcase the delta plugin and write tests

nice to have:

maybe add some tests with azurite and minio to simulate object storage

I hope this gives you a bit overview of what i did, i am happy to answer all the questions

I still fight with the mypy and formater in the pre-commit if you have some suggestions how to go about this i would appreciate it a lot

Collaborator

jwills commented Feb 27, 2024

@milicevica23 ah that is very cool, thanks for the update! I have some stuff going on this week but I will dive into this on Thursday!

jwills mentioned this pull request

[Question] Implementing incremental models for a custom plugin #349

Open

gui-elastic mentioned this pull request

[Question] pyiceberg 0.6.0 #350

Open

milicevica23 and others added 2 commits

March 2, 2024 21:06


          Merge branch 'duckdb:master' into feature/refactor-external-materiali…

9e50e7f

…zation


          add default native plugin; add delta write plugin; hack native plugin…

0115f62

… around df

Collaborator

jwills commented Mar 3, 2024

@milicevica23 when you get a chance could you redo the formatting according to the pre-commit run --all-files setup? All of the formatting diffs make this hard to review

milicevica23 added 2 commits

March 3, 2024 17:24


          add self reference macro; reformat

8aba866


          add an example for incremental delta

0b304c0

jwills reviewed

View reviewed changes

Collaborator

jwills left a comment

Some notes/comments/Qs-- general comment that there is lots of "coursor" vs. "cursor" stuff in the codebase now, would be good to do a find/replace on "coursor" -> "cursor"

Biggest Q is the one on the write perf impact of DuckDB doing an extra level of indirection via arrow before it writes the results of a query to a file

dbt/adapters/duckdb/plugins/__init__.py Outdated Show resolved Hide resolved

dbt/adapters/duckdb/plugins/__init__.py Outdated Show resolved Hide resolved

dbt/adapters/duckdb/plugins/native.py

+                  def store(self, df: DuckDBPyRelation, target_config: TargetConfig, cursor=None):
+                      location = target_config.location.path
+                      options = external_write_options(location, target_config.config.get("options", {}))
+                      cursor.sql(f"COPY (SELECT * FROM df) to '{location}' ({options})")

Collaborator

jwills Mar 4, 2024

Yeah so this is my core question: does this extra-level of indirection here (query to df to query of DF that writes to an external file) introduce overhead or a perf hit that is going to cause issues for existing external users? Or is the DuckDB optimizer smart enough to optimize this away (for e.g. arrow record batches or some such thing?)

Contributor Author

milicevica23 Mar 4, 2024

Hi @jwills, a very good question
I am unsure if here goes and convert to arrow, but I would empirically try something today to showcase if this is the case. I intentionally hand over DuckDBPyRelation so the plugin store function can choose the needed conversion.
Independent of the result, we would have a target_config.config.model.compiled_code variable, which we can embed into the string, and this should be the same as in the old path.

Contributor Author

milicevica23 Mar 4, 2024 •

edited

Hi @jwills, in my unscientific empirical experiment, we can see there that the new flow/method is faster than the previous one, where we first create a table and then export it
I also added the conversion over the arrow and pandas data frame to make it better comparable

The code is here: https://github.com/milicevica23/wtf-memory/blob/main/002-basics-duckdb/export-parquet-benchmark.py

Please be aware that there can be some mistakes in terms of cache and swapping memory, but I think it is enough to show that there is no significant difference in the concept of how duckdb retrieves the data

Contributor Author

milicevica23 Mar 4, 2024 •

edited

I also did a bit of experimenting with memory usage to see if i can visualize it; the new c) variant is better because there is no intermediate materialization

version b)

version c)

for other variants you can see them here: https://github.com/milicevica23/wtf-memory/tree/main/002-basics-duckdb/parquet-export

dbt/adapters/duckdb/plugins/postgres.py Show resolved Hide resolved

dbt/adapters/duckdb/plugins/sqlalchemy.py Show resolved Hide resolved

milicevica23 and others added 2 commits

March 4, 2024 14:13


          fix typos

a064f9c


          Merge branch 'duckdb:master' into feature/refactor-external-materiali…

9b32e17

…zation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment