[FEATURE] Enable Expectations tests for BigQuery (#3219)

- [FEATURE] Enable Expectations tests for BigQuery (#3219) - Co-authored-by: John DiMatteo <jdimatteo@verily.com>
great-expectations · Aug 18, 2021 · 2c89de6 · 2c89de6
1 parent 5b33a4c
commit 2c89de6
Show file tree

Hide file tree

Showing 17 changed files with 422 additions and 246 deletions.
diff --git a/azure-pipelines-cloud-db-integration.yml b/azure-pipelines-cloud-db-integration.yml
@@ -0,0 +1,49 @@
+stages:
+  - stage: cloud_db_integration
+    pool:
+      vmImage: 'ubuntu-latest'
+
+    jobs:
+      - job: bigquery_expectations
+        timeoutInMinutes: 0 # Maximize the time that pipelines remain open (6 hours currently)
+        variables:
+          python.version: '3.8'
+
+        strategy:
+          matrix:
+            expectations_cfe:
+              test_script: 'tests/test_definitions/test_expectations_cfe.py'
+            expectations:
+              test_script: 'tests/test_definitions/test_expectations.py'
+          maxParallel: 1
+
+        steps:
+          - task: UsePythonVersion@0
+            inputs:
+              versionSpec: '$(python.version)'
+            displayName: 'Use Python $(python.version)'
+
+          - bash: python -m pip install --upgrade pip==20.2.4
+            displayName: 'Update pip'
+
+          - script: |
+              pip install -r requirements-dev.txt
+
+            displayName: 'Install dependencies'
+
+          - task: DownloadSecureFile@1
+            name: gcp_authkey
+            displayName: 'Download Google Service Account'
+            inputs:
+              secureFile: 'superconductive-service-acct.json'
+              retryCount: '2'
+
+          - script: |
+              pip install pytest pytest-azurepipelines
+              pytest -v --no-spark --no-postgresql --bigquery --napoleon-docstrings --junitxml=junit/test-results.xml --cov=. --cov-report=xml --cov-report=html --ignore=tests/cli --ignore=tests/integration/usage_statistics $(test_script)
+
+            displayName: 'pytest'
+            env:
+              GOOGLE_APPLICATION_CREDENTIALS: $(gcp_authkey.secureFilePath)
+              GE_TEST_BIGQUERY_PROJECT: $(GE_TEST_BIGQUERY_PROJECT)
+              GE_TEST_BIGQUERY_DATASET: $(GE_TEST_BIGQUERY_DATASET)
diff --git a/docs/contributing/contributing_test.md b/docs/contributing/contributing_test.md
@@ -17,6 +17,23 @@ For example, you can run `pytest --no-spark --no-sqlalchemy` to skip all local b
 
 Note: as of early 2020, the tests generate many warnings. Most of these are generated by dependencies (pandas, sqlalchemy, etc.) You can suppress them with pytest’s `--disable-pytest-warnings` flag: `pytest --no-spark --no-sqlalchemy --disable-pytest-warnings`.
 
+#### BigQuery tests
+
+In order to run BigQuery tests, you first need to go through the following steps:
+
+1. [Select or create a Cloud Platform project](https://console.cloud.google.com/project).
+2. [Setup Authentication](https://googleapis.dev/python/google-api-core/latest/auth.html).
+3. In your project, [create a BigQuery dataset](https://cloud.google.com/bigquery/docs/datasets) named `test_ci` and [set the dataset default table expiration](https://cloud.google.com/bigquery/docs/updating-datasets#table-expiration) to `.1` days
+
+After setting up authentication, you can run with your project using the environment variable `GE_TEST_BIGQUERY_PROJECT`, e.g.
+
+```bash
+    GE_TEST_BIGQUERY_PROJECT=<YOUR_GOOGLE_CLOUD_PROJECT> 
+    pytest tests/test_definitions/test_expectations_cfe.py --bigquery --no-spark --no-postgresql
+```
+
+Note that if you prefer to use a different dataset besides "test_ci", you can specify a different dataset with `GE_TEST_BIGQUERY_DATASET`.
+
 ### Writing unit and integration tests
 
 Production code in Great Expectations must be thoroughly tested. In general, we insist on unit tests for all branches of every method, including likely error states. Most new feature contributions should include several unit tests. Contributions that modify or extend existing features should include a test of the new behavior.
@@ -25,7 +42,7 @@ Experimental code in Great Expectations need only be tested lightly. We are movi
 
 Most of Great Expectations’ integration testing is in the CLI, which naturally exercises most of the core code paths. Because integration tests require a lot of developer time to maintain, most contributions should not include new integration tests, unless they change the CLI itself.
 
-Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql and sqlite. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.
+Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql, sqlite, mssql, and BigQuery. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.
 
 ### Unit tests for Expectations
 One of Great Expectations’ important promises is that the same Expectation will produce the same result across all supported execution environments: pandas, sqlalchemy, and Spark.

diff --git a/docs/guides/validation/advanced/how_to_implement_a_custom_validation_action.md b/docs/guides/validation/advanced/how_to_implement_a_custom_validation_action.md
@@ -1,3 +1,4 @@
+
 ---
 title: How to implement a custom Validation Action
 ---

diff --git a/docs_rtd/contributing/testing.rst b/docs_rtd/contributing/testing.rst
@@ -21,7 +21,7 @@ If you did not configure optional backends for testing, tests against these back
 You can suppress these tests by adding the following flags:
 
 - ``--no-postgresql`` will skip postgres tests
-- ``--no-spark`` will skip spark tests 
+- ``--no-spark`` will skip spark tests
 - ``--no-sqlalchemy`` will skip all tests using sqlalchemy (i.e. all database backends)
 
 For example, you can run ``pytest --no-spark --no-sqlalchemy`` to skip all local backend tests (with the exception of the pandas backend). Please note that these tests will still be run by the CI as soon as you open a PR, so some tests might fail if your code changes affected them.
@@ -34,6 +34,28 @@ Note: as of early 2020, the tests generate many warnings. Most of these are gene
 
 .. _contributing_testing__writing_unit_tests:
 
+Running BigQuery tests
+----------------------
+
+In order to run BigQuery tests, you first need to go through the following steps:
+
+1. `Select or create a Cloud Platform project.`_
+2. `Setup Authentication.`_
+3. `In your project, create a BigQuery dataset named "test_ci"`_ and `set the dataset default table expiration to .1 days`_
+
+.. _Select or create a Cloud Platform project.: https://console.cloud.google.com/project
+.. _Setup Authentication.: https://googleapis.dev/python/google-api-core/latest/auth.html
+.. _`In your project, create a BigQuery dataset named "test_ci"`: https://cloud.google.com/bigquery/docs/datasets
+.. _`set the dataset default table expiration to .1 days`: https://cloud.google.com/bigquery/docs/updating-datasets#table-expiration
+
+After setting up authentication, you can run with your project using the environment variable `GE_TEST_BIGQUERY_PROJECT`, e.g.
+
+.. code-block::
+
+    GE_TEST_BIGQUERY_PROJECT=<YOUR_GOOGLE_CLOUD_PROJECT> pytest tests/test_definitions/test_expectations_cfe.py --bigquery --no-spark --no-postgresql -k bigquery
+
+Note that if you prefer to use a different dataset besides "test_ci", you can specify a different dataset with `GE_TEST_BIGQUERY_DATASET`.
+
 Writing unit and integration tests
 ----------------------------------
 
@@ -43,7 +65,7 @@ Experimental code in Great Expectations need only be tested lightly. We are movi
 
 Most of Great Expectations' integration testing is in the CLI, which naturally exercises most of the core code paths. Because integration tests require a lot of developer time to maintain, most contributions should *not* include new integration tests, unless they change the CLI itself.
 
-Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql and sqlite. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.
+Note: we do not currently test Great Expectations against all types of SQL database. CI test coverage for SQL is limited to postgresql, sqlite, mssql, and BigQuery. We have observed some bugs because of unsupported features or differences in SQL dialects, and we are actively working to improve dialect-specific support and testing.
 
 
 Unit tests for Expectations
@@ -113,7 +135,7 @@ Each item under ``datasets`` includes three entries: ``data``, ``schemas``, and
 **tests**
 
     ...define the tests to be executed against the dataframe. Each item in ``tests`` must have ``title``, ``exact_match_out``, ``in``, and ``out``. The test runner will execute the named Expectation once for each item, with the values in ``in`` supplied as kwargs.
-    
+
     The test passes if the values in the expectation validation result correspond with the values in ``out``. If ``exact_match_out`` is true, then every field in the Expectation output must have a corresponding, matching field in ``out``. If it's false, then only the fields specified in ``out`` need to match. For most use cases, false is a better fit, because it allows narrower targeting of the relevant output.
 
     ``suppress_test_for`` is an optional parameter to disable an Expectation for a specific list of backends.

diff --git a/great_expectations/dataset/sqlalchemy_dataset.py b/great_expectations/dataset/sqlalchemy_dataset.py
@@ -93,6 +93,16 @@
 try:
     import pybigquery.sqlalchemy_bigquery
 
+    ###
+    # NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
+    # to define an attribute "dialect". A PR has been submitted to fix this upstream
+    # with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
+    # fix isn't present, add this "dialect" attribute here:
+    if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
+        pybigquery.sqlalchemy_bigquery.dialect = (
+            pybigquery.sqlalchemy_bigquery.BigQueryDialect
+        )
+
     # Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
     # (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
     registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect")
@@ -542,32 +552,33 @@ def __init__(
             self._table = sa.Table(table_name, sa.MetaData(), schema=schema)
 
         # Get the dialect **for purposes of identifying types**
-        if self.engine.dialect.name.lower() in [
+        dialect_name: str = self.engine.dialect.name.lower()
+
+        if dialect_name in [
             "postgresql",
             "mysql",
             "sqlite",
             "oracle",
             "mssql",
+            "bigquery",
         ]:
             # These are the officially included and supported dialects by sqlalchemy
             self.dialect = import_library_module(
                 module_name="sqlalchemy.dialects." + self.engine.dialect.name
             )
-
-        elif self.engine.dialect.name.lower() == "snowflake":
+        elif dialect_name == "snowflake":
             self.dialect = import_library_module(
                 module_name="snowflake.sqlalchemy.snowdialect"
             )
-
-        elif self.engine.dialect.name.lower() == "redshift":
+        elif dialect_name == "redshift":
             self.dialect = import_library_module(
                 module_name="sqlalchemy_redshift.dialect"
             )
-        elif self.engine.dialect.name.lower() == "bigquery":
+        elif dialect_name == "bigquery":
             self.dialect = import_library_module(
                 module_name="pybigquery.sqlalchemy_bigquery"
             )
-        elif self.engine.dialect.name.lower() == "awsathena":
+        elif dialect_name == "awsathena":
             self.dialect = import_library_module(
                 module_name="pyathena.sqlalchemy_athena"
             )
@@ -2093,14 +2104,16 @@ def _get_dialect_like_pattern_expression(self, column, like_pattern, positive=Tr
 
         try:
             # Bigquery
-            if isinstance(
-                self.sql_engine_dialect, pybigquery.sqlalchemy_bigquery.BigQueryDialect
-            ):
+            if hasattr(self.sql_engine_dialect, "BigQueryDialect"):
                 dialect_supported = True
         except (
             AttributeError,
             TypeError,
         ):  # TypeError can occur if the driver was not installed and so is None
+            logger.debug(
+                "Unable to load BigQueryDialect dialect while running _get_dialect_like_pattern_expression",
+                exc_info=True,
+            )
             pass
 
         if isinstance(

diff --git a/great_expectations/execution_engine/sqlalchemy_execution_engine.py b/great_expectations/execution_engine/sqlalchemy_execution_engine.py
@@ -85,10 +85,20 @@
 try:
     import pybigquery.sqlalchemy_bigquery
 
-    # Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
+    ###
+    # NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
+    # to define an attribute "dialect". A PR has been submitted to fix this upstream
+    # with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
+    # fix isn't present, add this "dialect" attribute here:
+    if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
+        pybigquery.sqlalchemy_bigquery.dialect = (
+            pybigquery.sqlalchemy_bigquery.BigQueryDialect
+        )
+
+    # Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in Azure (our CI/CD pipeline) in certain cases, so we do it explicitly.
     # (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
     sa.dialects.registry.register(
-        "bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect"
+        "bigquery", "pybigquery.sqlalchemy_bigquery", "dialect"
     )
     try:
         getattr(pybigquery.sqlalchemy_bigquery, "INTEGER")

diff --git a/great_expectations/expectations/core/expect_column_values_to_be_of_type.py b/great_expectations/expectations/core/expect_column_values_to_be_of_type.py
@@ -56,9 +56,19 @@
 try:
     import pybigquery.sqlalchemy_bigquery
 
-    # Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
+    ###
+    # NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
+    # to define an attribute "dialect". A PR has been submitted to fix this upstream
+    # with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
+    # fix isn't present, add this "dialect" attribute here:
+    if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
+        pybigquery.sqlalchemy_bigquery.dialect = (
+            pybigquery.sqlalchemy_bigquery.BigQueryDialect
+        )
+
+    # Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in Azure (our CI/CD pipeline) in certain cases, so we do it explicitly.
     # (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
-    registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect")
+    registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "dialect")
     try:
         getattr(pybigquery.sqlalchemy_bigquery, "INTEGER")
         bigquery_types_tuple = None

diff --git a/great_expectations/expectations/metrics/util.py b/great_expectations/expectations/metrics/util.py
@@ -49,9 +49,18 @@
 try:
     import pybigquery.sqlalchemy_bigquery
 
-    # Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
+    ###
+    # NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
+    # to define an attribute "dialect". A PR has been submitted to fix this upstream
+    # with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
+    # fix isn't present, add this "dialect" attribute here:
+    if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
+        pybigquery.sqlalchemy_bigquery.dialect = (
+            pybigquery.sqlalchemy_bigquery.BigQueryDialect
+        )
+    # Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in Azure (our CI/CD pipeline) in certain cases, so we do it explicitly.
     # (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
-    registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect")
+    registry.register("bigquery", "pybigquery.sqlalchemy_bigquery", "dialect")
     try:
         getattr(pybigquery.sqlalchemy_bigquery, "INTEGER")
         bigquery_types_tuple = None
@@ -123,7 +132,7 @@ def get_dialect_regex_expression(column, regex, dialect, positive=True):
 
     try:
         # Bigquery
-        if issubclass(dialect.dialect, pybigquery.sqlalchemy_bigquery.BigQueryDialect):
+        if hasattr(dialect, "BigQueryDialect"):
             if positive:
                 return sa.func.REGEXP_CONTAINS(column, literal(regex))
             else:
@@ -132,6 +141,10 @@ def get_dialect_regex_expression(column, regex, dialect, positive=True):
         AttributeError,
         TypeError,
     ):  # TypeError can occur if the driver was not installed and so is None
+        logger.debug(
+            "Unable to load BigQueryDialect dialect while running get_dialect_regex_expression in expectations.metrics.util",
+            exc_info=True,
+        )
         pass
 
     return None
@@ -251,10 +264,10 @@ def column_reflection_fallback(
         columns_query: str = f"""
 SELECT
     SCHEMA_NAME(tab.schema_id) AS schema_name,
-    tab.name AS table_name, 
+    tab.name AS table_name,
     col.column_id AS column_id,
-    col.name AS column_name, 
-    t.name AS column_data_type,    
+    col.name AS column_name,
+    t.name AS column_data_type,
     col.max_length AS column_max_length,
     col.precision AS column_precision
 FROM sys.tables AS tab
@@ -264,7 +277,7 @@ def column_reflection_fallback(
     ON col.user_type_id = t.user_type_id
 WHERE tab.name = '{selectable}'
 ORDER BY schema_name,
-    table_name, 
+    table_name,
     column_id
 """
         col_info_query: TextClause = sa.text(columns_query)
@@ -301,7 +314,7 @@ def get_dialect_like_pattern_expression(column, dialect, like_pattern, positive=
 
     try:
         # Bigquery
-        if isinstance(dialect, pybigquery.sqlalchemy_bigquery.BigQueryDialect):
+        if hasattr(dialect, "BigQueryDialect"):
             dialect_supported = True
     except (
         AttributeError,