Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement BigQuery Table Schema Update Operator #15367

Merged
merged 1 commit into from May 4, 2021
Merged

Conversation

thejens
Copy link
Contributor

@thejens thejens commented Apr 14, 2021

With this change we implement a new operator that handles patching of table schemas in bigquery.

This is needed as typing out an entire schema data structure (schema), in order to set e.g. a field description on a single field requires a lot of overhead. Also, many times the schema is not known or very complex as it may be the result of a Query or parsed automatically when importing files as tables.

This operator is useful for a workflow like:
Upstream: Create a BigQuery table as the output of a Query or import operator. Writer of job/operator knows the names of the fields, perhaps the types, but not necessarily how other schema fields are defined.

Downstream (this operator): Supply a partial schema definition that only contains field names and description values that will be patched on to the "generated by bigquery" schema from upstream.

@thejens thejens force-pushed the master branch 4 times, most recently from 0f8cca8 to 791d158 Compare April 14, 2021 12:17
airflow/providers/google/cloud/operators/bigquery.py Outdated Show resolved Hide resolved
airflow/providers/google/cloud/operators/bigquery.py Outdated Show resolved Hide resolved
airflow/providers/google/cloud/operators/bigquery.py Outdated Show resolved Hide resolved
airflow/providers/google/cloud/operators/bigquery.py Outdated Show resolved Hide resolved
airflow/providers/google/cloud/operators/bigquery.py Outdated Show resolved Hide resolved
@thejens thejens force-pushed the master branch 2 times, most recently from 630d250 to 0fb0247 Compare April 16, 2021 15:29
@github-actions
Copy link

The Workflow run is cancelling this PR. It has some failed jobs matching ^Pylint$,^Static checks,^Build docs$,^Spell check docs$,^Provider packages,^Checks: Helm tests$,^Test OpenAPI*.

@thejens
Copy link
Contributor Author

thejens commented Apr 19, 2021

@marcosmarxm
Before I change more; I believe these are the three outstanding items from your review. How strongly do you feel about these? As in - before I go ahead and modify I'm keen to present some "why's" for my decisions...

How important would you say it is to move functionality into a hook? Solving the use-case was easy using two already existing hooks and I don't think I need to create an additional hook given I don't implement any new functionality in how we talk with BigQuery

Further I think mutating a mutable object shouldn't be a problem in this case and I am not too keen to deepcopy it for the sake of it.

Lastly the naming of the schema_fields parameter, I wanted to be consistent with naming and input format of that field between this and other operators, do you think it's a problem? Changing it into updates_to_schema_fields or similar is an easy possibility.

@marcosmarxm
Copy link
Contributor

@marcosmarxm
Before I change more; I believe these are the three outstanding items from your review. How strongly do you feel about these? As in - before I go ahead and modify I'm keen to present some "why's" for my decisions...

How important would you say it is to move functionality into a hook? Solving the use-case was easy using two already existing hooks and I don't think I need to create an additional hook given I don't implement any new functionality in how we talk with BigQuery

Further I think mutating a mutable object shouldn't be a problem in this case and I am not too keen to deepcopy it for the sake of it.

Lastly the naming of the schema_fields parameter, I wanted to be consistent with naming and input format of that field between this and other operators, do you think it's a problem? Changing it into updates_to_schema_fields or similar is an easy possibility.

  1. This is my point of view, but makes more sense you create a function call update_schema in the BigQueryHook. Why? Follow the convention, all functions that interact with bigquery are in the Hook not with Operators, in the future if someone search for a method to update_schema in the hook they won't find it. I know that the function you created doesn't interact directly with bigquery but you are creating the action update_schema and this interacts. Imagine a user reading the docs: BigQuery Hook and BigQuery Operators. Maybe you can talk to a PMC/Committer to get better feedback on this topic =D
  2. It's not a problem and maybe tricky is not the best word to describe. It's just a suggestion to be more readable and easy to follow the execute flow. Again maybe a PMC/Committer can you better on this also
  3. I read other operators and it's not a problem keeping the schema_fields. It's a suggestion to think more about users because keep the same name users think they need to send the info as in other Operators? (so it's a minor in my opinion)

The CI broke because of pylint can you run pre-commit locally to organize imports?

@thejens thejens force-pushed the master branch 2 times, most recently from ee376d1 to 3f3f5b7 Compare April 23, 2021 16:51
@thejens thejens changed the title Implement BigQuery Table Schema Patch Operator Implement BigQuery Table Schema Update Operator Apr 23, 2021
@thejens
Copy link
Contributor Author

thejens commented Apr 23, 2021

@marcosmarxm
Thanks for the feedback, I broke out the functionality into a new hook and added a unit-test for it. I also changed the name of the parameter and made the recursive function not rely on mutability.

@thejens thejens force-pushed the master branch 2 times, most recently from b059345 to bd9af54 Compare April 23, 2021 17:44
@thejens thejens force-pushed the master branch 2 times, most recently from 22a7093 to 7e128e2 Compare April 27, 2021 07:53
@github-actions
Copy link

The Workflow run is cancelling this PR. Building images for the PR has failed. Follow the workflow link to check the reason.

Copy link
Member

@potiuk potiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice!

@potiuk potiuk merged commit cf6324e into apache:master May 4, 2021
@potiuk
Copy link
Member

potiuk commented May 4, 2021

Thanks for the thorough reviews @marcosmarxm and @tswast. Great jiob @thejens !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:providers kind:documentation provider:google Google (including GCP) related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants