Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(bigquery): unit and system test for dataframe with int column with Nan values #39

Merged
merged 12 commits into from May 13, 2020
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
60 changes: 60 additions & 0 deletions tests/system.py
Expand Up @@ -736,6 +736,66 @@ def test_load_table_from_dataframe_w_automatic_schema(self):
)
self.assertEqual(table.num_rows, 3)

@unittest.skipIf(
pandas is None or pandas.__version__ < "1.0.1",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought - I think comparing plain version strings will actually work in this particular case, but cannot be generalized (e.g. somebody copies this pattern over to another test, or if we have to change the version for some reason).

Let's instead add a library like packaging as a test dependency, and use that for parsing and semantically comparing versions?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW, why pandas version 1.0.1+, as the code sample from the issue description also seems to works in pandas 1.0.0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, but need suggestion on version comparison like should i use parse(pandas.__version__) < parse("1.0.0") or should i change the pandas required minimum version in setup.py file and skip that if PY2 or something else?

Copy link
Contributor

@plamut plamut Feb 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the issue was due to a bug in Pandas, it's probably best to directly compare the version of Pandas rather than a proxy value such as the Python version.

Re: bumping the version - if the new Pandas version is still compatible with Python 2.7, we can bump it (we have tests...), otherwise we will have to wait until April or so when we officially drop Python 2 support (IIRC).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New version of pandas isn't compatible with Python 2.7, it Requires: Python >=3.6.1.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. We could still conditionally bump the pandas version for Python 3.5+ in setup.py, though, to at least guarantee correct behavior in Python 3, even though we cannot fix the bug in Python 2.7 due to pandas incompatibility.

@tswast What do you think?

P.S.: If there is an efficient way of detecting whether a column contains a NaN value, we could also emit a user warning if too old a pandas version is installed... just a thought.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tswast bump

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have setuptools as a required dependency, so why aren't you using pkg_resources?

https://github.com/pydata/pandas-gbq/blob/97e9a9ee3a8f65ede29f25954b1af3f3bc606f58/pandas_gbq/gbq.py#L46

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of places (including Google) are pretty slow to update pandas, so I'd hesitate to require 1.0+ just yet.

"Only `pandas version >=1.0.1` is supported",
)
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_load_table_from_dataframe_w_none_value(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nullable Int64 data type is the real reason for the test. There are plenty of existing tests & samples with loading None values object dtype.

"""Test that a DataFrame containing None-type values can be uploaded
if a BigQuery schema is specified.

https://github.com/googleapis/python-bigquery/issues/22
"""

dataset_id = _make_dataset_id("bq_load_test")
self.temp_dataset(dataset_id)
table_id = "{}.{}.load_table_from_dataframe_w_none_value".format(
Config.CLIENT.project, dataset_id
)
table_schema = (bigquery.SchemaField("x", "INTEGER", mode="NULLABLE"),)
table = retry_403(Config.CLIENT.create_table)(
Table(table_id, schema=table_schema)
)
self.to_delete.insert(0, table)

df_data = collections.OrderedDict(
[("x", pandas.Series([1, 2, None, 4], dtype="Int64"))]
)
dataframe = pandas.DataFrame(df_data, columns=df_data.keys())
load_job = Config.CLIENT.load_table_from_dataframe(dataframe, table_id)
load_job.result()
table = Config.CLIENT.get_table(table_id)
self.assertEqual(tuple(table.schema), (bigquery.SchemaField("x", "INTEGER"),))
self.assertEqual(table.num_rows, 4)

@unittest.skipIf(
pandas is None or pandas.__version__ < "1.0.1",
"Only `pandas version >=1.0.1` is supported",
)
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_load_table_from_dataframe_w_automatic_schema_w_none_value(self):
"""Test that a DataFrame containing None-type values can be uploaded
without specifying a schema.

https://github.com/googleapis/python-bigquery/issues/22
"""

dataset_id = _make_dataset_id("bq_load_test")
self.temp_dataset(dataset_id)
table_id = "{}.{}.load_table_from_dataframe_w_automatic_schema_w_none_value".format(
Config.CLIENT.project, dataset_id
)
df_data = collections.OrderedDict(
[("x", pandas.Series([1, 2, None, 4], dtype="Int64"))]
)
dataframe = pandas.DataFrame(df_data, columns=df_data.keys())
load_job = Config.CLIENT.load_table_from_dataframe(dataframe, table_id)
load_job.result()
table = Config.CLIENT.get_table(table_id)
self.assertEqual(tuple(table.schema), (bigquery.SchemaField("x", "INTEGER"),))
self.assertEqual(table.num_rows, 4)

@unittest.skipIf(pandas is None, "Requires `pandas`")
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_load_table_from_dataframe_w_nulls(self):
Expand Down
92 changes: 92 additions & 0 deletions tests/unit/test_client.py
Expand Up @@ -6631,6 +6631,98 @@ def test_load_table_from_dataframe_no_schema_warning_wo_pyarrow(self):
]
assert matches, "A missing schema deprecation warning was not raised."

@unittest.skipIf(
pandas is None or pandas.__version__ < "1.0.1",
"Only `pandas version >=1.0.1` is supported",
)
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_load_table_from_dataframe_w_none_value(self):
from google.cloud.bigquery.client import _DEFAULT_NUM_RETRIES
from google.cloud.bigquery import job
from google.cloud.bigquery.schema import SchemaField

client = self._make_client()
dataframe = pandas.DataFrame({"x": [1, 2, None, 4]}, dtype="Int64")
load_patch = mock.patch(
"google.cloud.bigquery.client.Client.load_table_from_file", autospec=True
)

get_table_patch = mock.patch(
"google.cloud.bigquery.client.Client.get_table",
autospec=True,
return_value=mock.Mock(schema=[SchemaField("x", "INT64", "NULLABLE")]),
)

with load_patch as load_table_from_file, get_table_patch:
client.load_table_from_dataframe(
dataframe, self.TABLE_REF, location=self.LOCATION
)

load_table_from_file.assert_called_once_with(
client,
mock.ANY,
self.TABLE_REF,
num_retries=_DEFAULT_NUM_RETRIES,
rewind=True,
job_id=mock.ANY,
job_id_prefix=None,
location=self.LOCATION,
project=None,
job_config=mock.ANY,
)

sent_config = load_table_from_file.mock_calls[0][2]["job_config"]
assert sent_config.source_format == job.SourceFormat.PARQUET
assert tuple(sent_config.schema) == (
SchemaField("x", "INT64", "NULLABLE", None),
)

@unittest.skipIf(
pandas is None or pandas.__version__ < "1.0.1",
"Only `pandas version >=1.0.1` is supported",
)
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_load_table_from_dataframe_w_automatic_schema_w_none_value(self):
from google.cloud.bigquery.client import _DEFAULT_NUM_RETRIES
from google.cloud.bigquery import job
from google.cloud.bigquery.schema import SchemaField

client = self._make_client()
dataframe = pandas.DataFrame({"x": [1, 2, None, 4]}, dtype="Int64")
load_patch = mock.patch(
"google.cloud.bigquery.client.Client.load_table_from_file", autospec=True
)

get_table_patch = mock.patch(
"google.cloud.bigquery.client.Client.get_table",
autospec=True,
side_effect=google.api_core.exceptions.NotFound("Table not found"),
)

with load_patch as load_table_from_file, get_table_patch:
client.load_table_from_dataframe(
dataframe, self.TABLE_REF, location=self.LOCATION
)

load_table_from_file.assert_called_once_with(
client,
mock.ANY,
self.TABLE_REF,
num_retries=_DEFAULT_NUM_RETRIES,
rewind=True,
job_id=mock.ANY,
job_id_prefix=None,
location=self.LOCATION,
project=None,
job_config=mock.ANY,
)

sent_config = load_table_from_file.mock_calls[0][2]["job_config"]
assert sent_config.source_format == job.SourceFormat.PARQUET
assert tuple(sent_config.schema) == (
SchemaField("x", "INT64", "NULLABLE", None),
)

@unittest.skipIf(pandas is None, "Requires `pandas`")
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`")
def test_load_table_from_dataframe_struct_fields_error(self):
Expand Down