Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactoring-external-materialization #332

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

milicevica23
Copy link
Contributor

Hi @jwills,
I wanted to open a pull request to ask questions and potentially propose the refactoring.

I will try in the next few days to list all the requirements and showcase all the implementation of the current external materialization here: https://github.com/l-mds/demo-dbt-duckdb-external-materialization

I would also like to go through that list and implement the missing tests before I start to refactor to ensure that the API stays the same I would also have some questions and need support there.

@jwills
Copy link
Collaborator

jwills commented Feb 8, 2024

Ah these are all good questions, will inline some comments shortly

@@ -27,6 +30,8 @@
{%- set backup_relation = make_backup_relation(target_relation, backup_relation_type) -%}
-- as above, the backup_relation should not already exist
{%- set preexisting_backup_relation = load_cached_relation(backup_relation) -%}

--What is grants here?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GRANTS in the context of e.g. Snowflake or another cloud DWH. DuckDB doesn't have an equivalent concept yet, though it wouldn't surprise me to see it in the context of MotherDuck or Iceberg/Delta tables.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will skip it for now to simplify, but the access grants here are an exciting topic. We want to export from the current context/duckdb into external parquet, delta, iceberg, and register a view over this location so that the table can be referenced in the next steps.

dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved
dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved
@@ -76,9 +82,10 @@
{{ drop_relation_if_exists(temp_relation) }}

-- register table into glue
--I dont know glue so can you explain me a bit about that?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a lot of history here-- the ability to register an external table with the AWS Glue catalog (and thus make it accessible to e.g. AWS Athena) actually predates the concept of dbt-duckdb plugins (in the BasePlugin sense.) The original idea of plugins for external relations was to generalize this capability to other cataloging systems.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, I see; I will need you to support me from the requirements point of view because I am unsure if I can imagine all possible use cases implemented until now. I don't have so much experience in the AWS world so more azure, google cloud but happy to learn about glue and athena
I will skip this one till we don't agree on some structure of the code; then, we can speak about details and each use case.

dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved
dbt/include/duckdb/macros/materializations/external.sql Outdated Show resolved Hide resolved
{%- set language = model['language'] -%}

{%- set target_relation = this.incorporate(type='view') %}

-- Continue as normal materialization
-- Why do we have temp and intermediate relation?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea is to not destroy an existing relation that has the same name as this one (if it exists in the catalog) in case something fails during the construction of the relation defined by this model.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i understand; we prepare everything and do all the transformation, and at the very end, we exchange the names

I understand how it fits into the context but am not sure what it means for me right now in the implementation, so make sure that i don't overlook that one in the implementation.

{%- set language = model['language'] -%}

{%- set target_relation = this.incorporate(type='view') %}

-- Continue as normal materialization
-- Why do we have temp and intermediate relation?
-- What does load_cached_relation do?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Built-in dbt method; dbt will do the equivalent of caching the DB catalog metadata it loaded during the run so as to minimize the number of round-trips it does to the database (less relevant for DuckDB, more relevant for e.g. MotherDuck.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will have to take a look into MotherDuck here because I still don't understand how this fits here and what the places are where I have to be careful not to overlook the functionality. So please, if you see something that is against MotherDuck logic, make me aware of that.

@milicevica23
Copy link
Contributor Author

Hi @jwills,
Thank you. I will try to add my comments and further questions and, in the next few days, make a first poc of how I think this should work. Please be critical

The main goal, for now, is to showcase the code structure, then we can go into details.

To simplify a bit, I will move out from the scope for now:

  • glue registration -> I don't have enough experience with the glue, so I have to check it out and will come to that afterward (it should work before merging)
  • arrow batch streaming -> so the first version will be with the arrow Table, which is the exact materialization, rather than exporting in the batches. see https://twitter.com/milicevica23/status/1754802477575078041 (important for delta)
  • assume that I don't reference a model in the different run so that functionality from "register_upstream_external_models" is not needed for now (it should work before merging)

@milicevica23
Copy link
Contributor Author

milicevica23 commented Feb 10, 2024

So, I added the structure and will start with the simple implementation of the native plugin, which will hopefully show if this makes sense.

This is still a work in progress, but If you have some comments or questions I would be happy to answer them.

I have an general question. Is there some concept of the release candidate / nightly that people can try this before they start to use it :D i am not sure if i can get all 1-to-1 api in the first run

@jwills
Copy link
Collaborator

jwills commented Feb 13, 2024

So, I added the structure and will start with the simple implementation of the native plugin, which will hopefully show if this makes sense.

This is still a work in progress, but If you have some comments or questions I would be happy to answer them.

I have an general question. Is there some concept of the release candidate / nightly that people can try this before they start to use it :D i am not sure if i can get all 1-to-1 api in the first run

There isn't a nightly or some such thing as of yet; if we add it, we will add it as a net-new thing that is disabled by default (which is how we've done a bunch of other things, like e.g. the fast CSV loading for dbt seed)

@milicevica23
Copy link
Contributor Author

Hi @jwills, can you take a look. I did the first draft of the native functionality and a new test which should cover some of the general cases

@milicevica23 milicevica23 marked this pull request as draft February 24, 2024 14:22
@milicevica23
Copy link
Contributor Author

milicevica23 commented Feb 25, 2024

Hi @jwills,
I think I finished the first round of refactoring, so maybe you could take a look and see if this makes sense to you.
What I did:

  • refactored the external materialization so that the creation of the external tables is handled by the plugin
  • the native plugin is a new plugin which does the parquet, json, csv
  • adapted the Excel and sqlalchemy plugins to use new code flow.
  • tried to fulfill and make all test run that were in place without big change
  • adapter upstream macro to register needed tables and data frames

The new code flow provides a nice feature that the exported file can be pointed directly from the target destination. It sounds trivial for native format but it is especially important for the further implementations of the iceberg, delta, (big query, snowflake ?) adapters, which can now be implemented within external materializations.
It also provides a very nice convenient way to register df in the upstream macro

What is there still to do:
missing:

  • Glue integration -> currently, i skipped it but will take a look and would maybe need your help for this because I never used it
  • python part of the external materialization -> i don't know so much about this and didn't find tests with that so I am not sure how to go about this but will take a look
  • buenavista integration -> what is the status there? I have to see what should be supported and adapt to that
  • other general stuff that I overlooked?

has to do:

  • native plugin has to be initialized on its own as default, and I have to find a place where to do it
  • find a way that native plugin load doesnt go over the arrow format but directly
  • add more tests for native, excel, sqlalchemy and others -> we have to take use cases from the community in order to cover all cases (i need here your help to motivate people :D)
  • look into the upstream macro because i think that with the old implementation, not all cases were done correctly because there is the problem with the rendering -> will try to explain in the future but the https://stackoverflow.com/questions/22655031/jinja2-how-to-put-a-block-in-an-if-statement shows how the rendering works and i think that we didn't use it right
  • showcase the delta plugin and write tests

nice to have:

  • maybe add some tests with azurite and minio to simulate object storage

I hope this gives you a bit overview of what i did, i am happy to answer all the questions

I still fight with the mypy and formater in the pre-commit if you have some suggestions how to go about this i would appreciate it a lot

@jwills
Copy link
Collaborator

jwills commented Feb 27, 2024

@milicevica23 ah that is very cool, thanks for the update! I have some stuff going on this week but I will dive into this on Thursday!

@jwills
Copy link
Collaborator

jwills commented Mar 3, 2024

@milicevica23 when you get a chance could you redo the formatting according to the pre-commit run --all-files setup? All of the formatting diffs make this hard to review

Copy link
Collaborator

@jwills jwills left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some notes/comments/Qs-- general comment that there is lots of "coursor" vs. "cursor" stuff in the codebase now, would be good to do a find/replace on "coursor" -> "cursor"

Biggest Q is the one on the write perf impact of DuckDB doing an extra level of indirection via arrow before it writes the results of a query to a file

dbt/adapters/duckdb/plugins/__init__.py Outdated Show resolved Hide resolved
dbt/adapters/duckdb/plugins/__init__.py Outdated Show resolved Hide resolved
def store(self, df: DuckDBPyRelation, target_config: TargetConfig, cursor=None):
location = target_config.location.path
options = external_write_options(location, target_config.config.get("options", {}))
cursor.sql(f"COPY (SELECT * FROM df) to '{location}' ({options})")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah so this is my core question: does this extra-level of indirection here (query to df to query of DF that writes to an external file) introduce overhead or a perf hit that is going to cause issues for existing external users? Or is the DuckDB optimizer smart enough to optimize this away (for e.g. arrow record batches or some such thing?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jwills, a very good question
I am unsure if here goes and convert to arrow, but I would empirically try something today to showcase if this is the case. I intentionally hand over DuckDBPyRelation so the plugin store function can choose the needed conversion.
Independent of the result, we would have a target_config.config.model.compiled_code variable, which we can embed into the string, and this should be the same as in the old path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jwills, in my unscientific empirical experiment, we can see there that the new flow/method is faster than the previous one, where we first create a table and then export it
I also added the conversion over the arrow and pandas data frame to make it better comparable

image

The code is here: https://github.com/milicevica23/wtf-memory/blob/main/002-basics-duckdb/export-parquet-benchmark.py

Please be aware that there can be some mistakes in terms of cache and swapping memory, but I think it is enough to show that there is no significant difference in the concept of how duckdb retrieves the data

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also did a bit of experimenting with memory usage to see if i can visualize it; the new c) variant is better because there is no intermediate materialization

version b)
image

version c)
image

for other variants you can see them here: https://github.com/milicevica23/wtf-memory/tree/main/002-basics-duckdb/parquet-export

dbt/adapters/duckdb/plugins/postgres.py Show resolved Hide resolved
dbt/adapters/duckdb/plugins/sqlalchemy.py Show resolved Hide resolved
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants