New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Enable Expectations tests for BigQuery #3219
[FEATURE] Enable Expectations tests for BigQuery #3219
Conversation
#3158) * Merged in my work from #3135: 1. added documentation 2. using helper function _create_bigquery_engine with gcp project environment variable 3. added bigquery specific list in candidate_test_is_on_temporary_notimplemented_list_cfe 4. removed bigquery specific test files, so we can just use the already existing ones (along with bigquery specific notimplemented list per #3) * enabled the expect_column_values_to_be_unique test, which passes now with Will's changes
Add content on table expiration
* develop: [DOCS] Apply Docusaurus tabs to relevant pages in new docs [FEATURE] Implement MulticolumnMapExpectation class (#3134) [FEATURE] add python3.9 to python versions (#3143) [FEATURE]/MER-16/MER-75/add_route_for_validation_result (#3090) [BUGFIX] Enable `--v3-api suite edit` to proceed without selecting DataConnectors (#3165) Error messages must be friendly. (#3171) [FEATURE] Implement the MulticolumnSumEqual Metric for PandasExecutionEngine (#3130) [BUGFIX] allow reader_options in the CLI so can read `.csv.gz` files (#2695) [BUGFIX] Fix error when `RuntimeBatchRequest` is passed to `SimpleCheckpoint` with `RuntimeDataConnector` (#3152) [FEATURE] Enable BigQuery tests for Azure CI/CD (#3155) [MAINTENANCE] Improve Coding Practices in "great_expectations/expectations/expectation.py" (#3151) [BUGFIX] Snowflake connections are closed correctly by DOCS tests (#3104) PMrelease-prep-20201-07-29 (#3144) [MAINTENANCE] Adhere to formalism: Use the defined Enum types for domain_type specification, and string key names for domain keys (#3146)
This reverts commit ce7ff46.
Co-authored-by: John DiMatteo <jdimatteo@verily.com>
This reverts commit c25846d.
…ntion and dataset cleanup (#3259) * remove duplicate inconsistent logic around bigquery dataset * consistently provide a default value and document how to override * Make BigQuery SqlAlchemy dialect conform to "dialect" attribute convention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This all looks really, really great! As long as we've documented which expectations need work and what issues are still pending, I think this is in great spot to merge.
Very impressive work @Shinnnyshinshin I know you've been at this for quite some time so all the effort is recognized and appreciated 🎉 🚀
### | ||
# NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects | ||
# to define an attribute "dialect". A PR has been submitted to fix this upstream | ||
# with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that | ||
# fix isn't present, add this "dialect" attribute here: | ||
if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"): | ||
pybigquery.sqlalchemy_bigquery.dialect = ( | ||
pybigquery.sqlalchemy_bigquery.BigQueryDialect | ||
) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this hasn't yet been added as a note in the Jira ticket, could you please do so? 😄
Excited to see what the resolution is here. Let's also make an internal reminder to clean up this note upon the PR's resolution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you :) It's GREAT-173
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI, this is fixed upstream with googleapis/python-bigquery-sqlalchemy#291 and is available in https://pypi.org/project/sqlalchemy-bigquery/1.1.0/.
# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly. | ||
# (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we happen to have a reason why? If so, could we add it to this note?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It appears to be something specific to how certain sqlalchemy packages are installed in Azure.. I'll add a note :)
sqlalchemy.dialects.registry.register( | ||
"bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect" | ||
) | ||
try: | ||
getattr(pybigquery.sqlalchemy_bigquery, "INTEGER") | ||
bigquery_types_tuple = {} | ||
BIGQUERY_TYPES = { | ||
"INTEGER": pybigquery.sqlalchemy_bigquery.INTEGER, | ||
"NUMERIC": pybigquery.sqlalchemy_bigquery.NUMERIC, | ||
"STRING": pybigquery.sqlalchemy_bigquery.STRING, | ||
"BIGNUMERIC": pybigquery.sqlalchemy_bigquery.BIGNUMERIC, | ||
"BYTES": pybigquery.sqlalchemy_bigquery.BYTES, | ||
"BOOL": pybigquery.sqlalchemy_bigquery.BOOL, | ||
"BOOLEAN": pybigquery.sqlalchemy_bigquery.BOOLEAN, | ||
"TIMESTAMP": pybigquery.sqlalchemy_bigquery.TIMESTAMP, | ||
"TIME": pybigquery.sqlalchemy_bigquery.TIME, | ||
"FLOAT": pybigquery.sqlalchemy_bigquery.FLOAT, | ||
"DATE": pybigquery.sqlalchemy_bigquery.DATE, | ||
"DATETIME": pybigquery.sqlalchemy_bigquery.DATETIME, | ||
} | ||
except AttributeError: | ||
# In older versions of the pybigquery driver, types were not exported, so we use a hack | ||
logger.warning( | ||
"Old pybigquery driver version detected. Consider upgrading to 0.4.14 or later." | ||
) | ||
from collections import namedtuple | ||
|
||
BigQueryTypes = namedtuple( | ||
"BigQueryTypes", sorted(pybigquery.sqlalchemy_bigquery._type_map) | ||
) | ||
bigquery_types_tuple = BigQueryTypes(**pybigquery.sqlalchemy_bigquery._type_map) | ||
except (ImportError, AttributeError): | ||
bigquery_types_tuple = None | ||
BigQueryDialect = None | ||
pybigquery = None | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If John signed off on this, it is alright by me. I'm always wary of try/except
but it is more of an accepted pattern in Python. If this still bugs you, maybe put a ticket in the backlog? Definitely not necessary though.
"expect_column_values_to_be_between", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261 | ||
"expect_column_values_to_be_of_type", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261 | ||
"expect_column_values_to_be_in_set", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261 | ||
"expect_column_values_to_be_in_type_list", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261 | ||
"expect_column_values_to_match_like_pattern_list", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261 | ||
"expect_column_values_to_not_match_like_pattern_list", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these also noted in Jira/Confluence? If not, would you mind adding it there for improved visibility?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GREAT-166 is the one for these
"expect_column_kl_divergence_to_be_less_than", # TODO: Takes over 64 minutes to "collect" (haven't actually seen it complete yet) -- https://github.com/great-expectations/great_expectations/issues/3260 | ||
"expect_column_values_to_be_in_set", # TODO: No matching signature for operator and AssertionError: expected ['2018-01-01T00:00:00'] but got ['2018-01-01'] -- https://github.com/great-expectations/great_expectations/issues/3260 | ||
"expect_column_values_to_be_in_type_list", # TODO: AssertionError -- https://github.com/great-expectations/great_expectations/issues/3260 | ||
"expect_column_values_to_be_between", # TODO: "400 No matching signature for operator >=" -- https://github.com/great-expectations/great_expectations/issues/3260 | ||
"expect_column_quantile_values_to_be_between", # TODO: takes over 15 minutes to "collect" (haven't actually seen it complete yet) -- https://github.com/great-expectations/great_expectations/issues/3260 | ||
"expect_column_mean_to_be_between", # TODO: "400 No matching signature for operator *" -- https://github.com/great-expectations/great_expectations/issues/3260 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same with these items
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and GREAT-159 for these
|
||
|
||
def _bigquery_dataset() -> str: | ||
return os.getenv("GE_TEST_BIGQUERY_DATASET", "test_ci") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just confirming that the default value "test_ci"
is not equivalent to the env var?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoot I missed this one (yes it's equivalent to the env_var). I'll take out before merging
This PR is a response to #3122 in collaboration with @jdimatteo. We attempt to increase test coverage of GreatExpectations with BigQuery by creating a CI/CD pipeline that runs Expectations tests (both CFE and non-CFE expectations) directly against a BigQuery backend.
In the process we have added a number of bugfixes, but there are still some (not many) tests that will fail in BigQuery, but not for other db-backends. Fixes for those Expectations will be part of upcoming PRs.
Relevant Issues and JIRA items
What has changed?
Added
azure-pipelines-cloud-db-integration.yml
test_expectations.py
) and V3 tests (test_expectations_cfe.py
) are run as separate stages in the pipeline. While building the pipeline, we found that BigQuery was hitting a rate-limit for loading data into our databse, caused by the fixture parameterization in Expectations tests. By separating the tests into separate stages, we attempt to minimize the number of concurrent requests.test_ci
database in Superconductive's internal BigQuery project. Currently there is time limit for new tables for 0.1 day, so temp_tables do not accumulate.Added
--big-query
flag to pytestconftest.py
, andtest_util.py
pytest tests/test_definitions/test_expectations_cfe.py --bigquery --no-spark --no-postgresql # and pytest tests/test_definitions/test_expectations.py --bigquery --no-spark --no-postgresql
Changes to Documentation
Equivalent content has been added to
docs
(new Docusaurus site) anddocs_rtd
(legacy Docs site) for the time being. The content written by @JDMatteo describes how users can set up a google platform project and dataset and get Expectation tests to run throughpytest
.docs/contributing/contributing_test.md
docs_rtd/contributing/testing.rst
Changes to Code :
AttributeError
where an object does not havedialect
as an attribute. This is because forpybigquery.sqlalchemy_bigquery
, The entireBigQueryDialect
module is registered as a dialect, unlike other backends where we only register[module].dialect
.Definition of Done
Thank you for submitting!