New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(bigquery): unit and system test for dataframe with int column with Nan values #39
Changes from 1 commit
b620f85
b906717
626cc8c
a1be872
b3a9781
a3547de
62eb1ad
db5fdc4
95ddeb8
734f895
d9cd2f1
66cf1fe
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -736,6 +736,65 @@ def test_load_table_from_dataframe_w_automatic_schema(self): | |||||
) | ||||||
self.assertEqual(table.num_rows, 3) | ||||||
|
||||||
@unittest.skipIf( | ||||||
pandas is None or pandas.__version__ < "1.0.1", | ||||||
"Only `pandas version >=1.0.1` supports", | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Seems like more correct? (disclaimer: not a native speaker) |
||||||
) | ||||||
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`") | ||||||
def test_load_table_from_dataframe_w_none_value(self): | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The nullable |
||||||
"""Test that a DataFrame with dtypes that map well to BigQuery types | ||||||
can be uploaded without specifying a schema. | ||||||
|
||||||
https://github.com/googleapis/python-bigquery/issues/22 | ||||||
""" | ||||||
df_data = collections.OrderedDict( | ||||||
[("x", pandas.Series([1, 2, None, 4], dtype="Int64"))] | ||||||
) | ||||||
dataframe = pandas.DataFrame(df_data, columns=df_data.keys()) | ||||||
table_schema = (bigquery.SchemaField("x", "INTEGER", mode="NULLABLE"),) | ||||||
dataset_id = _make_dataset_id("bq_load_test") | ||||||
self.temp_dataset(dataset_id) | ||||||
table_id = "{}.{}.load_table_from_dataframe_w_none_value".format( | ||||||
Config.CLIENT.project, dataset_id | ||||||
) | ||||||
table = retry_403(Config.CLIENT.create_table)( | ||||||
Table(table_id, schema=table_schema) | ||||||
) | ||||||
self.to_delete.insert(0, table) | ||||||
|
||||||
load_job = Config.CLIENT.load_table_from_dataframe(dataframe, table_id) | ||||||
load_job.result() | ||||||
table = Config.CLIENT.get_table(table_id) | ||||||
self.assertEqual(tuple(table.schema), ((bigquery.SchemaField("x", "INTEGER"),))) | ||||||
self.assertEqual(table.num_rows, 4) | ||||||
|
||||||
@unittest.skipIf( | ||||||
pandas is None or pandas.__version__ < "1.0.1", | ||||||
"Only `pandas version >=1.0.1` supports", | ||||||
) | ||||||
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`") | ||||||
def test_load_table_from_dataframe_w_automatic_schema_w_none_value(self): | ||||||
"""Test that a DataFrame with dtypes that map well to BigQuery types | ||||||
can be uploaded without specifying a schema. | ||||||
|
||||||
https://github.com/googleapis/python-bigquery/issues/22 | ||||||
""" | ||||||
df_data = collections.OrderedDict( | ||||||
[("x", pandas.Series([1, 2, None, 4], dtype="Int64"))] | ||||||
) | ||||||
dataframe = pandas.DataFrame(df_data, columns=df_data.keys()) | ||||||
dataset_id = _make_dataset_id("bq_load_test") | ||||||
self.temp_dataset(dataset_id) | ||||||
table_id = "{}.{}.load_table_from_dataframe_w_automatic_schema_w_none_value".format( | ||||||
Config.CLIENT.project, dataset_id | ||||||
) | ||||||
|
||||||
load_job = Config.CLIENT.load_table_from_dataframe(dataframe, table_id) | ||||||
load_job.result() | ||||||
table = Config.CLIENT.get_table(table_id) | ||||||
self.assertEqual(tuple(table.schema), ((bigquery.SchemaField("x", "INTEGER"),))) | ||||||
self.assertEqual(table.num_rows, 4) | ||||||
|
||||||
@unittest.skipIf(pandas is None, "Requires `pandas`") | ||||||
@unittest.skipIf(pyarrow is None, "Requires `pyarrow`") | ||||||
def test_load_table_from_dataframe_w_nulls(self): | ||||||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought - I think comparing plain version strings will actually work in this particular case, but cannot be generalized (e.g. somebody copies this pattern over to another test, or if we have to change the version for some reason).
Let's instead add a library like packaging as a test dependency, and use that for parsing and semantically comparing versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW, why pandas version
1.0.1+
, as the code sample from the issue description also seems to works in pandas 1.0.0?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed, but need suggestion on version comparison like should i use
parse(pandas.__version__) < parse("1.0.0")
or should i change the pandas required minimum version in setup.py file and skip that if PY2 or something else?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the issue was due to a bug in Pandas, it's probably best to directly compare the version of Pandas rather than a proxy value such as the Python version.
Re: bumping the version - if the new Pandas version is still compatible with Python 2.7, we can bump it (we have tests...), otherwise we will have to wait until April or so when we officially drop Python 2 support (IIRC).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New version of pandas isn't compatible with Python 2.7, it Requires: Python >=3.6.1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. We could still conditionally bump the pandas version for Python 3.5+ in
setup.py
, though, to at least guarantee correct behavior in Python 3, even though we cannot fix the bug in Python 2.7 due to pandas incompatibility.@tswast What do you think?
P.S.: If there is an efficient way of detecting whether a column contains a
NaN
value, we could also emit a user warning if too old a pandas version is installed... just a thought.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tswast bump
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have setuptools as a required dependency, so why aren't you using
pkg_resources
?https://github.com/pydata/pandas-gbq/blob/97e9a9ee3a8f65ede29f25954b1af3f3bc606f58/pandas_gbq/gbq.py#L46
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of places (including Google) are pretty slow to update pandas, so I'd hesitate to require 1.0+ just yet.