Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(ingest/snowflake): add additional fallback logic for very large schemas #10440

Open
wants to merge 35 commits into
base: master
Choose a base branch
from

Conversation

shcd-garjo3
Copy link

Checklist

#9698

  • [x ] The PR conforms to DataHub's Contributing Guideline (particularly Commit Message Format)
  • [ x] Links to related issues (if applicable)
  • Tests for the changes have been added/updated (if applicable)
  • Docs related to the changes have been added/updated (if applicable). If a new feature has been added a Usage Guide has been added for the same.
  • For any breaking change/potential downtime/deprecation/big changes an entry has been made in Updating DataHub

@github-actions github-actions bot added ingestion PR or Issue related to the ingestion of metadata community-contribution PR or Issue raised by member(s) of DataHub Community labels May 6, 2024
@shcd-garjo3
Copy link
Author

Added additional fallback logic in the event a show_views command fails on a schema with more than 10,000 objects. The fallback logic will call a new method that returns a list of schemas broken down by the first 5 characters of the name. This should produce a result set that is less than 10,000 rows.

views: List[SnowflakeView] = []
# Get a grouping of schema names by substring first
cur = self.query(SnowflakeQuery.get_views_by_name_substr(schema_name, db_name))
for row in cur.fetchall():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be better to use cur. fetchmany instead of fetchall. Then, you don't need all the start_with tricks, and it only requires a minimal code change. -> https://docs.sqlalchemy.org/en/20/core/connections.html#sqlalchemy.engine.CursorResult.fetchmany

Would you like to/can you update the PR, or would you prefer us to fix it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Tamas, wouldn't we still have to deal with the fact that Snowflake will fail on schemas with more than 10,000 items?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But, if this is really a python related improvement, I think it would be great. I am not sure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fetchall tries to fetch all the records at once, while fetchmany uses a cursor and paginates over the result.
I think fetchmany should not be affected by the 10k limit.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, that sounds great. I am totally fine if you make the change. Thanks!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, now that I think about it, I know you may be very busy. I can make the change if you want me to.
I could test with the ingestion I am running in my dev environment anyway. Please let me know.

db_clause = f'"{db_name}".' if db_name is not None else ""
return f"""show views in schema {db_clause}"{schema_name}";"""
starts_with_clause = f' starts with "{starts_with}"' if starts_with is not None else ""
return f"""show views in schema {db_clause}"{schema_name}" {starts_with_clause};"""
Copy link
Collaborator

@hsheth2 hsheth2 May 7, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the queries we send should be show views limit 10000 and then show views from 'previous_last_entry' limit 10000 for every subsequent call

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the fallback logic and am now sending a show views from <pagination_marker> wherein this pagination_marker is a truncated view name of the lowest bound view for a 10,000 row limit.
For example: if the first view is named FOO_BAR_BAZ, I will truncate to FOO_BAR_BA so as to guarantee that this "View Marker" has a lower lexicographic value.
Then all subsequent show view statements will use the next truncated view marker.
Should I create a new PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community-contribution PR or Issue raised by member(s) of DataHub Community ingestion PR or Issue related to the ingestion of metadata
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants