Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling views vs. tables #1

Open
ian-whitestone opened this issue Jul 20, 2022 · 2 comments
Open

Handling views vs. tables #1

ian-whitestone opened this issue Jul 20, 2022 · 2 comments

Comments

@ian-whitestone
Copy link

ian-whitestone commented Jul 20, 2022

First off, thanks for putting this together! Great stuff.

In stg_snowflake_query_tables.sql you are getting all the tables accessed by a query by looking at the direct_objects_accessed column in snowflake.account_usage.access_history. I was reading through these docs to better understand the difference between direct_objects_accessed and base_objects_accessed.

Let's say you have the following query:

SELECT * FROM my_view

And my_view is defined as CREATE VIEW my_view AS ( SELECT * FROM my_table WHERE foo = 'bar'). Based on this example in the docs, it sounds like:

  • my_view will appear in direct objects accessed
  • my_table will appear in base objects accessed

Note, probably worth validating this is actually the case. Let's assume it is for now.

As a result, querying stg_snowflake_query_tables could lead some users to think that my_table is not being used, since the only thing getting queried is my_view.

Could be worth throwing a caveat in the README.md (again assuming this is true). As a fix, you could look at modifying the code to check both base_objects and direct_objects, then do a union and distinct across all those (since in most cases, the same table will appear in both direct and base), or keep both around and add some boolean like is_directly_accessed and is_indirectly_accessed -- in base but not in direct.

@jaysobel
Copy link
Owner

Thanks for raising this issue. I originally made the decision to use direct and ignore base because I wanted to align the notion of "object" here with dbt's notion of "model". In the context of dbt I generally don't distinguish between views and tables.

To get at indirect usage, I have a model (not in this repo, yet) derived from the dbt manifest called node_descendents (expanded from the child_map), and I left join onto that, and aggregate both 'direct usage' and 'downstream usage'.

Is there another case where knowing about underlying views (inside or outside of dbt) would be valuable? I've considered making the table more polymorphic- like the modified_objects could be rows (though there seems to always be exactly 1) with an action label: modified, accessed, indirectly_accessed.

@ian-whitestone
Copy link
Author

👍

Is there another case where knowing about underlying views (inside or outside of dbt) would be valuable?

my current use case is trying to find tables I can remove to save storage costs. for that, i need to know what underlying tables (the things that incur the storage costs) are being used, either directly or indirectly through a view.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants