[FEATURE] Enable Expectations tests for BigQuery #3219

Shinnnyshinshin · 2021-08-11T00:48:21Z

This PR is a response to #3122 in collaboration with @jdimatteo. We attempt to increase test coverage of GreatExpectations with BigQuery by creating a CI/CD pipeline that runs Expectations tests (both CFE and non-CFE expectations) directly against a BigQuery backend.

In the process we have added a number of bugfixes, but there are still some (not many) tests that will fail in BigQuery, but not for other db-backends. Fixes for those Expectations will be part of upcoming PRs.

Relevant Issues and JIRA items

GREAT-66
GREAT-151/DEVREL-56
GREAT-152/DEVREL-85
GREAT-134
closes bigquery regex unsupported dialect #3131
closes Increase Automated Test Coverage With Bigquery #3132
closes BigQuery - expect.column_values_unique - module 'pybigquery.sqlalchemy_bigquery' has no attribute 'dialect' #2978
closes Error with BigQuery data connector and RuntimeDataConnector #2959

What has changed?

Added `azure-pipelines-cloud-db-integration.yml`

Currently the pipeline is explicitly not part of CI/CD or PR validation. It can be run manually, and is currently scheduled as a cron job that runs weekly (every Sunday).
V2 Expectation tests (test_expectations.py) and V3 tests (test_expectations_cfe.py) are run as separate stages in the pipeline. While building the pipeline, we found that BigQuery was hitting a rate-limit for loading data into our databse, caused by the fixture parameterization in Expectations tests. By separating the tests into separate stages, we attempt to minimize the number of concurrent requests.
Tests are run against test_ci database in Superconductive's internal BigQuery project. Currently there is time limit for new tables for 0.1 day, so temp_tables do not accumulate.
Pipeline will run in ~ 2 hrs, with V2 Expectations tests taking longer (approx 1.5 hr) than V3 CFE Expectations tests (~20 min)
Note : The credentials are loaded using : Azure Secure file

Added `--big-query` flag to pytest

This has resulted in changes to conftest.py, and test_util.py
Now tests for bigquery backend can be run using either:

pytest tests/test_definitions/test_expectations_cfe.py --bigquery --no-spark --no-postgresql

# and

pytest tests/test_definitions/test_expectations.py --bigquery --no-spark --no-postgresql

Changes to Documentation

Equivalent content has been added to docs (new Docusaurus site) and docs_rtd (legacy Docs site) for the time being. The content written by @JDMatteo describes how users can set up a google platform project and dataset and get Expectation tests to run through pytest.

docs/contributing/contributing_test.md
docs_rtd/contributing/testing.rst

Changes to Code :

A number of bugfixes that are related to an AttributeError where an object does not have dialect as an attribute. This is because for pybigquery.sqlalchemy_bigquery, The entire BigQueryDialect module is registered as a dialect, unlike other backends where we only register [module].dialect.

Definition of Done

My code follows the Great Expectations style guide
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
I have added unit tests where applicable and made sure that new and existing tests are passing.
I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

#3158) * Merged in my work from #3135: 1. added documentation 2. using helper function _create_bigquery_engine with gcp project environment variable 3. added bigquery specific list in candidate_test_is_on_temporary_notimplemented_list_cfe 4. removed bigquery specific test files, so we can just use the already existing ones (along with bigquery specific notimplemented list per #3) * enabled the expect_column_values_to_be_unique test, which passes now with Will's changes

Add content on table expiration

* develop: [DOCS] Apply Docusaurus tabs to relevant pages in new docs [FEATURE] Implement MulticolumnMapExpectation class (#3134) [FEATURE] add python3.9 to python versions (#3143) [FEATURE]/MER-16/MER-75/add_route_for_validation_result (#3090) [BUGFIX] Enable `--v3-api suite edit` to proceed without selecting DataConnectors (#3165) Error messages must be friendly. (#3171) [FEATURE] Implement the MulticolumnSumEqual Metric for PandasExecutionEngine (#3130) [BUGFIX] allow reader_options in the CLI so can read `.csv.gz` files (#2695) [BUGFIX] Fix error when `RuntimeBatchRequest` is passed to `SimpleCheckpoint` with `RuntimeDataConnector` (#3152) [FEATURE] Enable BigQuery tests for Azure CI/CD (#3155) [MAINTENANCE] Improve Coding Practices in "great_expectations/expectations/expectation.py" (#3151) [BUGFIX] Snowflake connections are closed correctly by DOCS tests (#3104) PMrelease-prep-20201-07-29 (#3144) [MAINTENANCE] Adhere to formalism: Use the defined Enum types for domain_type specification, and string key names for domain keys (#3146)

This reverts commit ce7ff46.

Co-authored-by: John DiMatteo <jdimatteo@verily.com>

This reverts commit c25846d.

…ntion and dataset cleanup (#3259) * remove duplicate inconsistent logic around bigquery dataset * consistently provide a default value and document how to override * Make BigQuery SqlAlchemy dialect conform to "dialect" attribute convention

…r-expectations

…r-expectations * develop: [BUGFIX] Fix deprecation warning for importing from collections (#3228) [FEATURE] Add schema validation for different GCS auth methods (#3258)

cdkini

This all looks really, really great! As long as we've documented which expectations need work and what issues are still pending, I think this is in great spot to merge.

Very impressive work @Shinnnyshinshin I know you've been at this for quite some time so all the effort is recognized and appreciated 🎉 🚀

cdkini · 2021-08-17T22:18:03Z

great_expectations/dataset/sqlalchemy_dataset.py

+    ###
+    # NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
+    # to define an attribute "dialect". A PR has been submitted to fix this upstream
+    # with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
+    # fix isn't present, add this "dialect" attribute here:
+    if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
+        pybigquery.sqlalchemy_bigquery.dialect = (
+            pybigquery.sqlalchemy_bigquery.BigQueryDialect
+        )
+


If this hasn't yet been added as a note in the Jira ticket, could you please do so? 😄

Excited to see what the resolution is here. Let's also make an internal reminder to clean up this note upon the PR's resolution.

Thank you :) It's GREAT-173

FYI, this is fixed upstream with googleapis/python-bigquery-sqlalchemy#291 and is available in https://pypi.org/project/sqlalchemy-bigquery/1.1.0/.

cdkini · 2021-08-17T22:19:01Z