Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: show schemas difference when throwing InvalidSchema exception #350

Closed
wants to merge 1 commit into from
Closed

ENH: show schemas difference when throwing InvalidSchema exception #350

wants to merge 1 commit into from

Conversation

artemrys
Copy link

After merging this PR - to_gbq method will show a more detailed description of an error in case of append mode.

In particular, it will show at most 3 differences in schemas (both missing field and the same field but a different type) and will indicate how many more differences left (if any).

In #349 there was a suggestion to use set to compare schemas, but the elements have dict type, so they are not hashable. So I decided to go with a more straightforward solution and compare dataframe and BigQuery fields one by one. And I am doing this only if the dataframe schema is not a subset of a BigQuery schema to leave the number of successful path operations the same.

Do you want me to add more system-level tests to verify different exception messages or is it enough to have them covered by unit tests?

@artemrys artemrys marked this pull request as ready for review December 23, 2020 15:05
Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution! I left some feedback in the review. I'm concerned about the case where column orders don't line up.

else:
diffs_left = len(schema_difference) - 3
schema_difference = schema_difference[:3]
if diffs_left != 0:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this always evaluate to True? Might be throwing off our branch test-coverage numbers if so.

Comment on lines +91 to +98
if len(schema_difference) < 4:
diff_to_show = "\n".join(schema_difference)
else:
diffs_left = len(schema_difference) - 3
schema_difference = schema_difference[:3]
if diffs_left != 0:
schema_difference.append("And {} more left.".format(diffs_left))
diff_to_show = "\n".join(schema_difference)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Since the diff_to_show = "\n".join(schema_difference) is duplicated, we can refactor this a bit.

Suggested change
if len(schema_difference) < 4:
diff_to_show = "\n".join(schema_difference)
else:
diffs_left = len(schema_difference) - 3
schema_difference = schema_difference[:3]
if diffs_left != 0:
schema_difference.append("And {} more left.".format(diffs_left))
diff_to_show = "\n".join(schema_difference)
if len(schema_difference) > 3:
diffs_left = len(schema_difference) - 3
schema_difference = schema_difference[:3]
schema_difference.append("And {} more left.".format(diffs_left))
diff_to_show = "\n".join(schema_difference)

"""Calculates difference in dataframe and BigQuery schemas.

Compares dataframe and BigQuery schemas to identify exact differences
in each field (field can be missing in the dataframe or field can have
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing fields in the dataframe should be OK, right? It's extra fields in the dataframe that can be a problem (unless we allow field addition, as requested here: #107)

for field_remote in fields_remote:
for field_local in fields_local:
if field_local["name"] == field_remote["name"]:
if field_local["type"] != field_remote["type"]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might miss type mismatches if the order of the columns is different in either field_remote or field_local (or does _clean_schema_fields sort them?). I think we'd want:

  1. Loop through only fields_local (dataframe)
  2. Check if the field name isn't in fields_remote (table)
  3. Check if the field types don't match

),
(
[
{"name": "A", "type": "FLOAT"},
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we'll want some tests where the order of the columns doesn't line up. That way we can be sure we aren't showing errors that aren't actually errors.

Copy link
Collaborator

@tswast tswast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops. Last review was meant to be request changes

@product-auto-label product-auto-label bot added the api: bigquery Issues related to the googleapis/python-bigquery-pandas API. label Jul 17, 2021
@meredithslota
Copy link
Contributor

Checking in - this PR has set for some time. Close? Revise?

@artemrys
Copy link
Author

Hey, sorry, unfortunately, I do not have time to work on this one

@tswast tswast closed this Dec 23, 2021
@artemrys artemrys deleted the feature/better-message-when-different-schemas branch December 31, 2021 14:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the googleapis/python-bigquery-pandas API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve InvalidSchema exception by adding specific fields that do not match
3 participants