fix: Handle nested fields from BigQuery source when getting default column_names #522

ivanmkc · 2021-07-01T20:06:31Z

Fixes 'ds.column_names doesn't contain nested fields in BigQuery table' issue.

According to the product team, only "leaf node" fields should be passed to AutoML tabular.

Added unit test for it.

Fixes b/191864144 🦕

…s to return a Set instead of a List

geraint0923

Thanks for fixing this!

google/cloud/aiplatform/datasets/tabular_dataset.py

sasha-gitg · 2021-07-02T15:45:11Z

google/cloud/aiplatform/datasets/tabular_dataset.py

@@ -40,7 +42,7 @@ class TabularDataset(datasets._Dataset):
    )

    @property
-    def column_names(self) -> List[str]:


Changing the return type introduces a breaking change. Perhaps convert back to list after after deduping with set. What is the scenario where the same column name is populated more than once? Assuming that's the motivation using set. Trying to understand if this is worth introducing a breaking change.

The set comparison is trivial because col_names are unique and order is not a factor.

List implies an order which is not relevant for column names and makes the unit tests a tiny (very tiny) bit more complicated to write.

I see your point about a breaking change. What do you recommend?

I agree with the motivation to change the return type to a Set. Let's do the following:

Leave the return type as List to avoid the breaking change.

Proceed with this PR mainly as is.

Open a ticket to track the return type change to Set.

Tentatively plan to implement the return type change when we get closer to a larger breaking change and major version rev.

Agreed on all points.

Tracked in b/193044977

I kept the private method return types as Set as I assume that it's acceptable to make breaking changes to private methods.

@sasha-gitg made the changes.

tests/unit/aiplatform/test_datasets.py

google/cloud/aiplatform/datasets/tabular_dataset.py

ivanmkc · 2021-07-07T17:55:21Z

tests/unit/aiplatform/test_datasets.py

@@ -1045,7 +1098,16 @@ def test_tabular_dataset_column_name_bq_with_creds(self, bq_client_mock):
    def test_tabular_dataset_column_name_bigquery(self):
        my_dataset = datasets.TabularDataset(dataset_name=_TEST_NAME)

-        assert my_dataset.column_names == ["column_1", "column_2"]
+        assert my_dataset.column_names == set(


@sasha-gitg Using a set means that when I write a unit-test, I don't have to know about the implementation details on how the list is ordered.

There are workarounds, but using a set seems cleanest.

…ot introduce a breaking change at this time

Handle nested fields from BigQuery source

d1a1c4d

ivanmkc requested a review from a team as a code owner July 1, 2021 20:06

product-auto-label bot added the api: aiplatform Issues related to the AI Platform API. label Jul 1, 2021

google-cla bot added the cla: yes This human has signed the Contributor License Agreement. label Jul 1, 2021

ivanmkc requested a review from sirtorry July 1, 2021 20:06

ivanmkc changed the title ~~Handle nested fields from BigQuery source when getting default column_names~~ [WIP] Handle nested fields from BigQuery source when getting default column_names Jul 1, 2021

ivanmkc changed the title ~~[WIP] Handle nested fields from BigQuery source when getting default column_names~~ Handle nested fields from BigQuery source when getting default column_names Jul 1, 2021

ivanmkc requested a review from tswast July 1, 2021 21:21

Added unit test for nested BigQuery fields and refactored column_name…

e581093

…s to return a Set instead of a List

ivanmkc force-pushed the imkc--tabular-default-bq-nested-columns branch from 549357e to e581093 Compare July 1, 2021 21:30

ivanmkc changed the title ~~Handle nested fields from BigQuery source when getting default column_names~~ fix: Handle nested fields from BigQuery source when getting default column_names Jul 1, 2021

Added comment

dec0f8c

ivanmkc requested a review from sasha-gitg July 1, 2021 21:52

geraint0923 approved these changes Jul 1, 2021

View reviewed changes

sasha-gitg requested changes Jul 2, 2021

View reviewed changes

Fixed minor issues with tabular_dataset

49d6eb6

ivanmkc commented Jul 7, 2021

View reviewed changes

Switched TabularDataset.column_names back to returning a List as to n…

cc90645

…ot introduce a breaking change at this time

ivanmkc force-pushed the imkc--tabular-default-bq-nested-columns branch from 40c6a30 to cc90645 Compare July 7, 2021 21:42

sasha-gitg approved these changes Jul 8, 2021

View reviewed changes

ivanmkc merged commit 3fc1d44 into googleapis:master Jul 8, 2021

sirtorry mentioned this pull request Jul 29, 2021

ds.column_names doesn't contain nested fields in BigQuery table #504

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Handle nested fields from BigQuery source when getting default column_names #522

fix: Handle nested fields from BigQuery source when getting default column_names #522

ivanmkc commented Jul 1, 2021 •

edited

geraint0923 left a comment

sasha-gitg Jul 2, 2021

ivanmkc Jul 7, 2021

ivanmkc Jul 7, 2021

sasha-gitg Jul 7, 2021

ivanmkc Jul 7, 2021

ivanmkc Jul 7, 2021

ivanmkc Jul 7, 2021 •

edited

ivanmkc Jul 7, 2021

ivanmkc Jul 7, 2021

fix: Handle nested fields from BigQuery source when getting default column_names #522

fix: Handle nested fields from BigQuery source when getting default column_names #522

Conversation

ivanmkc commented Jul 1, 2021 • edited

geraint0923 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc Jul 7, 2021 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivanmkc commented Jul 1, 2021 •

edited

ivanmkc Jul 7, 2021 •

edited