fix(ingest/snowflake): add additional fallback logic for very large schemas #10440

shcd-garjo3 · 2024-05-06T22:46:57Z

Checklist

[x ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
[ x] Links to related issues (if applicable)
Tests for the changes have been added/updated (if applicable)
Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

…th 10K+ views. Added additional fallback logic

…pagination

shcd-garjo3 · 2024-05-06T22:49:00Z

Added additional fallback logic in the event a show_views command fails on a schema with more than 10,000 objects. The fallback logic will call a new method that returns a list of schemas broken down by the first 5 characters of the name. This should produce a result set that is less than 10,000 rows.

treff7es · 2024-05-07T14:57:02Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_schema.py

+        views: List[SnowflakeView] = []
+        # Get a grouping of schema names by substring first
+        cur = self.query(SnowflakeQuery.get_views_by_name_substr(schema_name, db_name))
+        for row in cur.fetchall():


I think it would be better to use cur. fetchmany instead of fetchall. Then, you don't need all the start_with tricks, and it only requires a minimal code change. -> https://docs.sqlalchemy.org/en/20/core/connections.html#sqlalchemy.engine.CursorResult.fetchmany

Would you like to/can you update the PR, or would you prefer us to fix it?

Hi Tamas, wouldn't we still have to deal with the fact that Snowflake will fail on schemas with more than 10,000 items?

But, if this is really a python related improvement, I think it would be great. I am not sure.

Fetchall tries to fetch all the records at once, while fetchmany uses a cursor and paginates over the result.
I think fetchmany should not be affected by the 10k limit.

Okay, that sounds great. I am totally fine if you make the change. Thanks!

Sorry, now that I think about it, I know you may be very busy. I can make the change if you want me to.
I could test with the ingestion I am running in my dev environment anyway. Please let me know.

hsheth2 · 2024-05-07T19:18:32Z

metadata-ingestion/src/datahub/ingestion/source/snowflake/snowflake_query.py

        db_clause = f'"{db_name}".' if db_name is not None else ""
-        return f"""show views in schema {db_clause}"{schema_name}";"""
+        starts_with_clause = f' starts with "{starts_with}"' if starts_with is not None else ""
+        return f"""show views in schema {db_clause}"{schema_name}" {starts_with_clause};"""


the queries we send should be show views limit 10000 and then show views from 'previous_last_entry' limit 10000 for every subsequent call

I've updated the fallback logic and am now sending a show views from <pagination_marker> wherein this pagination_marker is a truncated view name of the lowest bound view for a 10,000 row limit.
For example: if the first view is named FOO_BAR_BAZ, I will truncate to FOO_BAR_BA so as to guarantee that this "View Marker" has a lower lexicographic value.
Then all subsequent show view statements will use the next truncated view marker.
Should I create a new PR?

…sequent rows

…pagination

…for loop

…pagination

shcd-garjo3 and others added 5 commits May 3, 2024 14:08

Addressed Snowflake ingestion breaking when ingesting from schemas wi…

b82ad11

…th 10K+ views. Added additional fallback logic

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

8b91140

…pagination

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

9992a7f

…pagination

testing proper placement of methods

bda77a2

re-added get_views_for_schema_starts_with function declaration

3638fa3

github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels May 6, 2024

Merge branch 'master' into fix-snowflake-schema-view-pagination

02c67de

vercel bot deployed to Preview May 6, 2024 23:15 View deployment

treff7es requested changes May 7, 2024

View reviewed changes

hsheth2 reviewed May 7, 2024

View reviewed changes

Added new pagination query along with method to keep track of all sub…

7fb0c95

…sequent rows

vercel bot deployed to Preview May 13, 2024 18:06 View deployment

shcd-garjo3 and others added 2 commits May 13, 2024 18:18

added pagination documentation to correct module

8dc1ec4

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

f226883

…pagination

vercel bot deployed to Preview May 13, 2024 18:37 View deployment

moved return views on snowflake_schema.py file to outside of initial …

e8deadb

…for loop

vercel bot deployed to Preview May 13, 2024 19:25 View deployment

Merge branch 'master' into fix-snowflake-schema-view-pagination

5a2bf2c

vercel bot deployed to Preview May 13, 2024 19:56 View deployment

shcd-garjo3 and others added 2 commits May 15, 2024 12:50

corrected the from_view_marker_clause assignment on snowflake_query.py

d190107

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

ec9f86e

…pagination

vercel bot had a problem deploying to Preview May 15, 2024 13:14 Failure

Merge branch 'master' into fix-snowflake-schema-view-pagination

64076ee

vercel bot deployed to Preview May 15, 2024 14:45 View deployment

shcd-garjo3 requested a review from treff7es May 15, 2024 19:56

Merge branch 'master' into fix-snowflake-schema-view-pagination

41f76c4

vercel bot deployed to Preview May 15, 2024 20:11 View deployment

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

672dea3

…pagination

vercel bot deployed to Preview May 16, 2024 04:09 View deployment

shcd-garjo3 and others added 2 commits May 16, 2024 07:33

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

331ad36

…pagination

added print statement to pagination_markers_v2

8f73195

vercel bot deployed to Preview May 16, 2024 13:01 View deployment

shcd-garjo3 added 2 commits May 16, 2024 18:23

added a bunch of print statements for testing

48d6cc9

testing with print statements

04fb726

vercel bot deployed to Preview May 16, 2024 18:47 View deployment

shcd-garjo3 and others added 2 commits May 16, 2024 14:52

Merge branch 'master' into fix-snowflake-schema-view-pagination

99d0135

testing with limit clause

b6e5740

vercel bot deployed to Preview May 16, 2024 20:21 View deployment

testing updated logic for failing when no pagination is sent over

304520d

vercel bot deployed to Preview May 16, 2024 23:58 View deployment

testing with pagination markers

2a1265a

vercel bot deployed to Preview May 17, 2024 00:51 View deployment

updated mismatched variable name from cursor to cur

7801ae4

vercel bot deployed to Preview May 17, 2024 01:39 View deployment

shcd-garjo3 and others added 2 commits May 17, 2024 07:21

Merge branch 'master' into fix-snowflake-schema-view-pagination

f5b3e8f

testing fetchmany(1000)

6cb3270

vercel bot deployed to Preview May 17, 2024 12:50 View deployment

shcd-garjo3 and others added 2 commits May 21, 2024 07:49

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

e324f4f

…pagination

testing with updated show views in schema statement

5dee7fc

vercel bot deployed to Preview May 21, 2024 13:32 View deployment

shcd-garjo3 and others added 2 commits May 23, 2024 07:48

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

d032e68

…pagination

trying to deteremine where the calls are breaking down

7263d14

vercel bot deployed to Preview May 23, 2024 13:17 View deployment

shcd-garjo3 and others added 2 commits May 23, 2024 14:22

Merge branch 'datahub-project:master' into fix-snowflake-schema-view-…

0c39cc7

…pagination

testing

d8a5ac1

vercel bot deployed to Preview May 23, 2024 19:41 View deployment

testing with v2 again

e94f281

vercel bot deployed to Preview May 23, 2024 21:38 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(ingest/snowflake): add additional fallback logic for very large schemas #10440

fix(ingest/snowflake): add additional fallback logic for very large schemas #10440

shcd-garjo3 commented May 6, 2024

shcd-garjo3 commented May 6, 2024

treff7es May 7, 2024

shcd-garjo3 May 7, 2024

shcd-garjo3 May 7, 2024

treff7es May 7, 2024

shcd-garjo3 May 7, 2024

shcd-garjo3 May 7, 2024

hsheth2 May 7, 2024 •

edited

shcd-garjo3 May 13, 2024

fix(ingest/snowflake): add additional fallback logic for very large schemas #10440

Are you sure you want to change the base?

fix(ingest/snowflake): add additional fallback logic for very large schemas #10440

Conversation

shcd-garjo3 commented May 6, 2024

Checklist

shcd-garjo3 commented May 6, 2024

treff7es May 7, 2024

Choose a reason for hiding this comment

shcd-garjo3 May 7, 2024

Choose a reason for hiding this comment

shcd-garjo3 May 7, 2024

Choose a reason for hiding this comment

treff7es May 7, 2024

Choose a reason for hiding this comment

shcd-garjo3 May 7, 2024

Choose a reason for hiding this comment

shcd-garjo3 May 7, 2024

Choose a reason for hiding this comment

hsheth2 May 7, 2024 • edited

Choose a reason for hiding this comment

shcd-garjo3 May 13, 2024

Choose a reason for hiding this comment

hsheth2 May 7, 2024 •

edited