Skip to content

Commit

Permalink
[FEATURE] Enable Expectations tests for BigQuery (#3219)
Browse files Browse the repository at this point in the history
- [FEATURE] Enable Expectations tests for BigQuery (#3219)
- Co-authored-by: John DiMatteo <jdimatteo@verily.com>
  • Loading branch information
William Shin committed Aug 18, 2021
1 parent 5b33a4c commit 2c89de6
Show file tree
Hide file tree
Showing 17 changed files with 422 additions and 246 deletions.
49 changes: 49 additions & 0 deletions azure-pipelines-cloud-db-integration.yml
@@ -0,0 +1,49 @@
stages:
- stage: cloud_db_integration
pool:
vmImage: 'ubuntu-latest'

jobs:
- job: bigquery_expectations
timeoutInMinutes: 0 # Maximize the time that pipelines remain open (6 hours currently)
variables:
python.version: '3.8'

strategy:
matrix:
expectations_cfe:
test_script: 'tests/test_definitions/test_expectations_cfe.py'
expectations:
test_script: 'tests/test_definitions/test_expectations.py'
maxParallel: 1

steps:
- task: UsePythonVersion@0
inputs:
versionSpec: '$(python.version)'
displayName: 'Use Python $(python.version)'

- bash: python -m pip install --upgrade pip==20.2.4
displayName: 'Update pip'

- script: |
pip install -r requirements-dev.txt
displayName: 'Install dependencies'
- task: DownloadSecureFile@1
name: gcp_authkey
displayName: 'Download Google Service Account'
inputs:
secureFile: 'superconductive-service-acct.json'
retryCount: '2'

- script: |
pip install pytest pytest-azurepipelines
pytest -v --no-spark --no-postgresql --bigquery --napoleon-docstrings --junitxml=junit/test-results.xml --cov=. --cov-report=xml --cov-report=html --ignore=tests/cli --ignore=tests/integration/usage_statistics $(test_script)
displayName: 'pytest'
env:
GOOGLE_APPLICATION_CREDENTIALS: $(gcp_authkey.secureFilePath)
GE_TEST_BIGQUERY_PROJECT: $(GE_TEST_BIGQUERY_PROJECT)
GE_TEST_BIGQUERY_DATASET: $(GE_TEST_BIGQUERY_DATASET)
19 changes: 18 additions & 1 deletion docs/contributing/contributing_test.md
Expand Up @@ -17,6 +17,23 @@ For example, you can run `pytest --no-spark --no-sqlalchemy` to skip all local b

Note: as of early 2020, the tests generate many warnings. Most of these are generated by dependencies (pandas, sqlalchemy, etc.) You can suppress them with pytest’s `--disable-pytest-warnings` flag: `pytest --no-spark --no-sqlalchemy --disable-pytest-warnings`.

#### BigQuery tests

In order to run BigQuery tests, you first need to go through the following steps:

1. [Select or create a Cloud Platform project](https://console.cloud.google.com/project).
2. [Setup Authentication](https://googleapis.dev/python/google-api-core/latest/auth.html).
3. In your project, [create a BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets) named `test_ci` and [set the dataset default table expiration](https://cloud.google.com/bigquery/docs/updating-datasets#table-expiration) to `.1` days

After setting up authentication, you can run with your project using the environment variable `GE_TEST_BIGQUERY_PROJECT`, e.g.

```bash
GE_TEST_BIGQUERY_PROJECT=<YOUR_GOOGLE_CLOUD_PROJECT>
pytest tests/test_definitions/test_expectations_cfe.py --bigquery --no-spark --no-postgresql
```

Note that if you prefer to use a different dataset besides "test_ci", you can specify a different dataset with `GE_TEST_BIGQUERY_DATASET`.

### Writing unit and integration tests

Production code in Great Expectations must be thoroughly tested. In general, we insist on unit tests for all branches of every method, including likely error states. Most new feature contributions should include several unit tests. Contributions that modify or extend existing features should include a test of the new behavior.
Expand All @@ -25,7 +42,7 @@ Experimental code in Great Expectations need only be tested lightly. We are movi

Most of Great Expectations’ integration testing is in the CLI, which naturally exercises most of the core code paths. Because integration tests require a lot of developer time to maintain, most contributions should not include new integration tests, unless they change the CLI itself.

Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql and sqlite. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.
Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql, sqlite, mssql, and BigQuery. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.

### Unit tests for Expectations
One of Great Expectations’ important promises is that the same Expectation will produce the same result across all supported execution environments: pandas, sqlalchemy, and Spark.
Expand Down
@@ -1,3 +1,4 @@

---
title: How to implement a custom Validation Action
---
Expand Down
28 changes: 25 additions & 3 deletions docs_rtd/contributing/testing.rst
Expand Up @@ -21,7 +21,7 @@ If you did not configure optional backends for testing, tests against these back
You can suppress these tests by adding the following flags:

- ``--no-postgresql`` will skip postgres tests
- ``--no-spark`` will skip spark tests
- ``--no-spark`` will skip spark tests
- ``--no-sqlalchemy`` will skip all tests using sqlalchemy (i.e. all database backends)

For example, you can run ``pytest --no-spark --no-sqlalchemy`` to skip all local backend tests (with the exception of the pandas backend). Please note that these tests will still be run by the CI as soon as you open a PR, so some tests might fail if your code changes affected them.
Expand All @@ -34,6 +34,28 @@ Note: as of early 2020, the tests generate many warnings. Most of these are gene
.. _contributing_testing__writing_unit_tests:

Running BigQuery tests
----------------------

In order to run BigQuery tests, you first need to go through the following steps:

1. `Select or create a Cloud Platform project.`_
2. `Setup Authentication.`_
3. `In your project, create a BigQuery dataset named "test_ci"`_ and `set the dataset default table expiration to .1 days`_

.. _Select or create a Cloud Platform project.: https://console.cloud.google.com/project
.. _Setup Authentication.: https://googleapis.dev/python/google-api-core/latest/auth.html
.. _`In your project, create a BigQuery dataset named "test_ci"`: https://cloud.google.com/bigquery/docs/datasets
.. _`set the dataset default table expiration to .1 days`: https://cloud.google.com/bigquery/docs/updating-datasets#table-expiration

After setting up authentication, you can run with your project using the environment variable `GE_TEST_BIGQUERY_PROJECT`, e.g.

.. code-block::
GE_TEST_BIGQUERY_PROJECT=<YOUR_GOOGLE_CLOUD_PROJECT> pytest tests/test_definitions/test_expectations_cfe.py --bigquery --no-spark --no-postgresql -k bigquery
Note that if you prefer to use a different dataset besides "test_ci", you can specify a different dataset with `GE_TEST_BIGQUERY_DATASET`.

Writing unit and integration tests
----------------------------------

Expand All @@ -43,7 +65,7 @@ Experimental code in Great Expectations need only be tested lightly. We are movi

Most of Great Expectations' integration testing is in the CLI, which naturally exercises most of the core code paths. Because integration tests require a lot of developer time to maintain, most contributions should *not* include new integration tests, unless they change the CLI itself.

Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql and sqlite. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.
Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql, sqlite, mssql, and BigQuery. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.


Unit tests for Expectations
Expand Down Expand Up @@ -113,7 +135,7 @@ Each item under ``datasets`` includes three entries: ``data``, ``schemas``, and
**tests**

...define the tests to be executed against the dataframe. Each item in ``tests`` must have ``title``, ``exact_match_out``, ``in``, and ``out``. The test runner will execute the named Expectation once for each item, with the values in ``in`` supplied as kwargs.

The test passes if the values in the expectation validation result correspond with the values in ``out``. If ``exact_match_out`` is true, then every field in the Expectation output must have a corresponding, matching field in ``out``. If it's false, then only the fields specified in ``out`` need to match. For most use cases, false is a better fit, because it allows narrower targeting of the relevant output.

``suppress_test_for`` is an optional parameter to disable an Expectation for a specific list of backends.
Expand Down
33 changes: 23 additions & 10 deletions great_expectations/dataset/sqlalchemy_dataset.py
Expand Up @@ -93,6 +93,16 @@
try:
import pybigquery.sqlalchemy_bigquery

###
# NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
# to define an attribute "dialect". A PR has been submitted to fix this upstream
# with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
# fix isn't present, add this "dialect" attribute here:
if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
pybigquery.sqlalchemy_bigquery.dialect = (
pybigquery.sqlalchemy_bigquery.BigQueryDialect
)

# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
# (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect")
Expand Down Expand Up @@ -542,32 +552,33 @@ def __init__(
self._table = sa.Table(table_name, sa.MetaData(), schema=schema)

# Get the dialect **for purposes of identifying types**
if self.engine.dialect.name.lower() in [
dialect_name: str = self.engine.dialect.name.lower()

if dialect_name in [
"postgresql",
"mysql",
"sqlite",
"oracle",
"mssql",
"bigquery",
]:
# These are the officially included and supported dialects by sqlalchemy
self.dialect = import_library_module(
module_name="sqlalchemy.dialects." + self.engine.dialect.name
)

elif self.engine.dialect.name.lower() == "snowflake":
elif dialect_name == "snowflake":
self.dialect = import_library_module(
module_name="snowflake.sqlalchemy.snowdialect"
)

elif self.engine.dialect.name.lower() == "redshift":
elif dialect_name == "redshift":
self.dialect = import_library_module(
module_name="sqlalchemy_redshift.dialect"
)
elif self.engine.dialect.name.lower() == "bigquery":
elif dialect_name == "bigquery":
self.dialect = import_library_module(
module_name="pybigquery.sqlalchemy_bigquery"
)
elif self.engine.dialect.name.lower() == "awsathena":
elif dialect_name == "awsathena":
self.dialect = import_library_module(
module_name="pyathena.sqlalchemy_athena"
)
Expand Down Expand Up @@ -2093,14 +2104,16 @@ def _get_dialect_like_pattern_expression(self, column, like_pattern, positive=Tr

try:
# Bigquery
if isinstance(
self.sql_engine_dialect, pybigquery.sqlalchemy_bigquery.BigQueryDialect
):
if hasattr(self.sql_engine_dialect, "BigQueryDialect"):
dialect_supported = True
except (
AttributeError,
TypeError,
): # TypeError can occur if the driver was not installed and so is None
logger.debug(
"Unable to load BigQueryDialect dialect while running _get_dialect_like_pattern_expression",
exc_info=True,
)
pass

if isinstance(
Expand Down
14 changes: 12 additions & 2 deletions great_expectations/execution_engine/sqlalchemy_execution_engine.py
Expand Up @@ -85,10 +85,20 @@
try:
import pybigquery.sqlalchemy_bigquery

# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
###
# NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
# to define an attribute "dialect". A PR has been submitted to fix this upstream
# with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
# fix isn't present, add this "dialect" attribute here:
if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
pybigquery.sqlalchemy_bigquery.dialect = (
pybigquery.sqlalchemy_bigquery.BigQueryDialect
)

# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in Azure (our CI/CD pipeline) in certain cases, so we do it explicitly.
# (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
sa.dialects.registry.register(
"bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect"
"bigquery", "pybigquery.sqlalchemy_bigquery", "dialect"
)
try:
getattr(pybigquery.sqlalchemy_bigquery, "INTEGER")
Expand Down
Expand Up @@ -56,9 +56,19 @@
try:
import pybigquery.sqlalchemy_bigquery

# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
###
# NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
# to define an attribute "dialect". A PR has been submitted to fix this upstream
# with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
# fix isn't present, add this "dialect" attribute here:
if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
pybigquery.sqlalchemy_bigquery.dialect = (
pybigquery.sqlalchemy_bigquery.BigQueryDialect
)

# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in Azure (our CI/CD pipeline) in certain cases, so we do it explicitly.
# (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect")
registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "dialect")
try:
getattr(pybigquery.sqlalchemy_bigquery, "INTEGER")
bigquery_types_tuple = None
Expand Down
29 changes: 21 additions & 8 deletions great_expectations/expectations/metrics/util.py
Expand Up @@ -49,9 +49,18 @@
try:
import pybigquery.sqlalchemy_bigquery

# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
###
# NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
# to define an attribute "dialect". A PR has been submitted to fix this upstream
# with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
# fix isn't present, add this "dialect" attribute here:
if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
pybigquery.sqlalchemy_bigquery.dialect = (
pybigquery.sqlalchemy_bigquery.BigQueryDialect
)
# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in Azure (our CI/CD pipeline) in certain cases, so we do it explicitly.
# (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect")
registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "dialect")
try:
getattr(pybigquery.sqlalchemy_bigquery, "INTEGER")
bigquery_types_tuple = None
Expand Down Expand Up @@ -123,7 +132,7 @@ def get_dialect_regex_expression(column, regex, dialect, positive=True):

try:
# Bigquery
if issubclass(dialect.dialect, pybigquery.sqlalchemy_bigquery.BigQueryDialect):
if hasattr(dialect, "BigQueryDialect"):
if positive:
return sa.func.REGEXP_CONTAINS(column, literal(regex))
else:
Expand All @@ -132,6 +141,10 @@ def get_dialect_regex_expression(column, regex, dialect, positive=True):
AttributeError,
TypeError,
): # TypeError can occur if the driver was not installed and so is None
logger.debug(
"Unable to load BigQueryDialect dialect while running get_dialect_regex_expression in expectations.metrics.util",
exc_info=True,
)
pass

return None
Expand Down Expand Up @@ -251,10 +264,10 @@ def column_reflection_fallback(
columns_query: str = f"""
SELECT
SCHEMA_NAME(tab.schema_id) AS schema_name,
tab.name AS table_name,
tab.name AS table_name,
col.column_id AS column_id,
col.name AS column_name,
t.name AS column_data_type,
col.name AS column_name,
t.name AS column_data_type,
col.max_length AS column_max_length,
col.precision AS column_precision
FROM sys.tables AS tab
Expand All @@ -264,7 +277,7 @@ def column_reflection_fallback(
ON col.user_type_id = t.user_type_id
WHERE tab.name = '{selectable}'
ORDER BY schema_name,
table_name,
table_name,
column_id
"""
col_info_query: TextClause = sa.text(columns_query)
Expand Down Expand Up @@ -301,7 +314,7 @@ def get_dialect_like_pattern_expression(column, dialect, like_pattern, positive=

try:
# Bigquery
if isinstance(dialect, pybigquery.sqlalchemy_bigquery.BigQueryDialect):
if hasattr(dialect, "BigQueryDialect"):
dialect_supported = True
except (
AttributeError,
Expand Down

0 comments on commit 2c89de6

Please sign in to comment.