-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(ingest/snowflake): add additional fallback logic for very large schemas #10440
base: master
Are you sure you want to change the base?
fix(ingest/snowflake): add additional fallback logic for very large schemas #10440
Conversation
…th 10K+ views. Added additional fallback logic
Added additional fallback logic in the event a show_views command fails on a schema with more than 10,000 objects. The fallback logic will call a new method that returns a list of schemas broken down by the first 5 characters of the name. This should produce a result set that is less than 10,000 rows. |
views: List[SnowflakeView] = [] | ||
# Get a grouping of schema names by substring first | ||
cur = self.query(SnowflakeQuery.get_views_by_name_substr(schema_name, db_name)) | ||
for row in cur.fetchall(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it would be better to use cur. fetchmany
instead of fetchall
. Then, you don't need all the start_with tricks, and it only requires a minimal code change. -> https://docs.sqlalchemy.org/en/20/core/connections.html#sqlalchemy.engine.CursorResult.fetchmany
Would you like to/can you update the PR, or would you prefer us to fix it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi Tamas, wouldn't we still have to deal with the fact that Snowflake will fail on schemas with more than 10,000 items?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But, if this is really a python related improvement, I think it would be great. I am not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fetchall tries to fetch all the records at once, while fetchmany uses a cursor and paginates over the result.
I think fetchmany should not be affected by the 10k limit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that sounds great. I am totally fine if you make the change. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, now that I think about it, I know you may be very busy. I can make the change if you want me to.
I could test with the ingestion I am running in my dev environment anyway. Please let me know.
db_clause = f'"{db_name}".' if db_name is not None else "" | ||
return f"""show views in schema {db_clause}"{schema_name}";""" | ||
starts_with_clause = f' starts with "{starts_with}"' if starts_with is not None else "" | ||
return f"""show views in schema {db_clause}"{schema_name}" {starts_with_clause};""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the queries we send should be show views limit 10000
and then show views from 'previous_last_entry' limit 10000
for every subsequent call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the fallback logic and am now sending a show views from <pagination_marker> wherein this pagination_marker is a truncated view name of the lowest bound view for a 10,000 row limit.
For example: if the first view is named FOO_BAR_BAZ, I will truncate to FOO_BAR_BA so as to guarantee that this "View Marker" has a lower lexicographic value.
Then all subsequent show view statements will use the next truncated view marker.
Should I create a new PR?
Checklist
#9698