Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

deps: require pyarrow for pandas support #314

Merged
merged 4 commits into from Oct 12, 2020
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
4 changes: 0 additions & 4 deletions docs/snippets.py
Expand Up @@ -26,10 +26,6 @@

import pytest

try:
import fastparquet
except (ImportError, AttributeError):
fastparquet = None
try:
import pandas
except (ImportError, AttributeError):
Expand Down
3 changes: 1 addition & 2 deletions google/cloud/bigquery/client.py
Expand Up @@ -2152,8 +2152,7 @@ def load_table_from_dataframe(
Raises:
ImportError:
If a usable parquet engine cannot be found. This method
requires :mod:`pyarrow` or :mod:`fastparquet` to be
installed.
requires :mod:`pyarrow` to be installed.
TypeError:
If ``job_config`` is not an instance of :class:`~google.cloud.bigquery.job.LoadJobConfig`
class.
Expand Down
5 changes: 1 addition & 4 deletions noxfile.py
Expand Up @@ -49,10 +49,7 @@ def default(session):
constraints_path,
)

# fastparquet is not included in .[all] because, in general, it's
# redundant with pyarrow. We still want to run some unit tests with
# fastparquet serialization, though.
session.install("-e", ".[all,fastparquet]", "-c", constraints_path)
session.install("-e", ".[all]", "-c", constraints_path)

session.install("ipython", "-c", constraints_path)

Expand Down
8 changes: 0 additions & 8 deletions setup.py
Expand Up @@ -53,7 +53,6 @@
"pyarrow >= 1.0.0, < 2.0dev",
],
"tqdm": ["tqdm >= 4.7.4, <5.0.0dev"],
"fastparquet": ["fastparquet", "python-snappy", "llvmlite>=0.34.0"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see us add "pyarrow" to the "pandas" extras now, since it's needed for both uploads and downloads to dataframe.

We can maybe refactor the pyarrow >=1.0.0,<2.0dev string into a variable since it's going to appear 3 times in setup.py now too

"opentelemetry": [
"opentelemetry-api==0.9b0",
"opentelemetry-sdk==0.9b0",
Expand All @@ -64,13 +63,6 @@
all_extras = []

for extra in extras:
if extra in (
# Skip fastparquet from "all" because it is redundant with pyarrow and
# creates a dependency on pre-release versions of numpy. See:
# https://github.com/googleapis/google-cloud-python/issues/8549
"fastparquet",
):
continue
all_extras.extend(extras[extra])

extras["all"] = all_extras
Expand Down
1 change: 0 additions & 1 deletion testing/constraints-3.6.txt
@@ -1,4 +1,3 @@
fastparquet==0.4.1
google-api-core==1.22.2
google-cloud-bigquery-storage==2.0.0
google-cloud-core==1.4.1
Expand Down
7 changes: 0 additions & 7 deletions tests/unit/test_client.py
Expand Up @@ -32,10 +32,6 @@
import pytz
import pkg_resources

try:
import fastparquet
except (ImportError, AttributeError): # pragma: NO COVER
fastparquet = None
try:
import pandas
except (ImportError, AttributeError): # pragma: NO COVER
Expand Down Expand Up @@ -7839,7 +7835,6 @@ def test_load_table_from_dataframe_unknown_table(self):
)

@unittest.skipIf(pandas is None, "Requires `pandas`")
@unittest.skipIf(fastparquet is None, "Requires `fastparquet`")
def test_load_table_from_dataframe_no_pyarrow_warning(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit surprised to see this test passing. I guess we still have some code that falls back to the default pandas parquet rendering?

Can you look into if we can remove that code path?

Related: We should be able to simplify this docstring now:

parquet_compression (Optional[str]):
[Beta] The compression method to use if intermittently
serializing ``dataframe`` to a parquet file.
If ``pyarrow`` and job config schema are used, the argument
is directly passed as the ``compression`` argument to the
underlying ``pyarrow.parquet.write_table()`` method (the
default value "snappy" gets converted to uppercase).
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow-parquet-write-table
If either ``pyarrow`` or job config schema are missing, the
argument is directly passed as the ``compression`` argument
to the underlying ``DataFrame.to_parquet()`` method.
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_parquet.html#pandas.DataFrame.to_parquet

from google.cloud.bigquery.client import PyarrowMissingWarning

Expand Down Expand Up @@ -7875,7 +7870,6 @@ def test_load_table_from_dataframe_no_pyarrow_warning(self):
assert matches, "A missing pyarrow deprecation warning was not raised."

@unittest.skipIf(pandas is None, "Requires `pandas`")
@unittest.skipIf(fastparquet is None, "Requires `fastparquet`")
def test_load_table_from_dataframe_no_schema_warning_wo_pyarrow(self):
client = self._make_client()

Expand Down Expand Up @@ -8182,7 +8176,6 @@ def test_load_table_from_dataframe_w_partial_schema_extra_types(self):
assert "unknown_col" in message

@unittest.skipIf(pandas is None, "Requires `pandas`")
@unittest.skipIf(fastparquet is None, "Requires `fastparquet`")
def test_load_table_from_dataframe_w_partial_schema_missing_types(self):
from google.cloud.bigquery.client import _DEFAULT_NUM_RETRIES
from google.cloud.bigquery import job
Expand Down