Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Enable Expectations tests for BigQuery #3219

Merged

Conversation

Shinnnyshinshin
Copy link
Contributor

@Shinnnyshinshin Shinnnyshinshin commented Aug 11, 2021

This PR is a response to #3122 in collaboration with @jdimatteo. We attempt to increase test coverage of GreatExpectations with BigQuery by creating a CI/CD pipeline that runs Expectations tests (both CFE and non-CFE expectations) directly against a BigQuery backend.

In the process we have added a number of bugfixes, but there are still some (not many) tests that will fail in BigQuery, but not for other db-backends. Fixes for those Expectations will be part of upcoming PRs.

Relevant Issues and JIRA items

What has changed?

Added azure-pipelines-cloud-db-integration.yml

  • Currently the pipeline is explicitly not part of CI/CD or PR validation. It can be run manually, and is currently scheduled as a cron job that runs weekly (every Sunday).
  • V2 Expectation tests (test_expectations.py) and V3 tests (test_expectations_cfe.py) are run as separate stages in the pipeline. While building the pipeline, we found that BigQuery was hitting a rate-limit for loading data into our databse, caused by the fixture parameterization in Expectations tests. By separating the tests into separate stages, we attempt to minimize the number of concurrent requests.
  • Tests are run against test_ci database in Superconductive's internal BigQuery project. Currently there is time limit for new tables for 0.1 day, so temp_tables do not accumulate.
  • Pipeline will run in ~ 2 hrs, with V2 Expectations tests taking longer (approx 1.5 hr) than V3 CFE Expectations tests (~20 min)
  • Note : The credentials are loaded using : Azure Secure file

Added --big-query flag to pytest

  • This has resulted in changes to conftest.py, and test_util.py
  • Now tests for bigquery backend can be run using either:
pytest tests/test_definitions/test_expectations_cfe.py --bigquery --no-spark --no-postgresql

# and

pytest tests/test_definitions/test_expectations.py --bigquery --no-spark --no-postgresql

Changes to Documentation

Equivalent content has been added to docs (new Docusaurus site) and docs_rtd (legacy Docs site) for the time being. The content written by @JDMatteo describes how users can set up a google platform project and dataset and get Expectation tests to run through pytest.

  • docs/contributing/contributing_test.md
  • docs_rtd/contributing/testing.rst

Changes to Code :

  • A number of bugfixes that are related to an AttributeError where an object does not have dialect as an attribute. This is because for pybigquery.sqlalchemy_bigquery, The entire BigQueryDialect module is registered as a dialect, unlike other backends where we only register [module].dialect.

Definition of Done

  • My code follows the Great Expectations style guide
  • I have performed a self-review of my own code
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added unit tests where applicable and made sure that new and existing tests are passing.
  • I have run any local integration tests and made sure that nothing is broken.

Thank you for submitting!

Shinnnyshinshin and others added 30 commits July 29, 2021 23:17
#3158)

* Merged in my work from #3135:

1. added documentation
2. using helper function _create_bigquery_engine with gcp project environment variable
3. added bigquery specific list in candidate_test_is_on_temporary_notimplemented_list_cfe
4. removed bigquery specific test files, so we can just use the already existing ones (along with bigquery specific notimplemented list per #3)

* enabled the expect_column_values_to_be_unique test, which passes now with Will's changes
Add content on table expiration
* develop:
  [DOCS] Apply Docusaurus tabs to relevant pages in new docs
  [FEATURE] Implement MulticolumnMapExpectation class (#3134)
  [FEATURE] add python3.9 to python versions (#3143)
  [FEATURE]/MER-16/MER-75/add_route_for_validation_result (#3090)
  [BUGFIX] Enable `--v3-api suite edit` to proceed without selecting DataConnectors (#3165)
  Error messages must be friendly. (#3171)
  [FEATURE] Implement the MulticolumnSumEqual Metric for PandasExecutionEngine (#3130)
  [BUGFIX] allow reader_options in the CLI so can read `.csv.gz` files (#2695)
  [BUGFIX] Fix error when `RuntimeBatchRequest` is passed to `SimpleCheckpoint` with `RuntimeDataConnector` (#3152)
  [FEATURE] Enable BigQuery tests for Azure CI/CD (#3155)
  [MAINTENANCE] Improve Coding Practices in "great_expectations/expectations/expectation.py" (#3151)
  [BUGFIX] Snowflake connections are closed correctly by DOCS tests  (#3104)
  PMrelease-prep-20201-07-29 (#3144)
  [MAINTENANCE] Adhere to formalism: Use the defined Enum types for domain_type specification, and string key names for domain keys (#3146)
This reverts commit ce7ff46.
Shinnnyshinshin and others added 8 commits August 16, 2021 22:50
…ntion and dataset cleanup (#3259)

* remove duplicate inconsistent logic around bigquery dataset
* consistently provide a default value and document how to override
* Make BigQuery SqlAlchemy dialect conform to "dialect" attribute convention
…r-expectations

* develop:
  [BUGFIX] Fix deprecation warning for importing from collections (#3228)
  [FEATURE] Add schema validation for different GCS auth methods (#3258)
Copy link
Member

@cdkini cdkini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This all looks really, really great! As long as we've documented which expectations need work and what issues are still pending, I think this is in great spot to merge.

Very impressive work @Shinnnyshinshin I know you've been at this for quite some time so all the effort is recognized and appreciated 🎉 🚀

Comment on lines +96 to +105
###
# NOTE: 20210816 - jdimatteo: A convention we rely on is for SqlAlchemy dialects
# to define an attribute "dialect". A PR has been submitted to fix this upstream
# with https://github.com/googleapis/python-bigquery-sqlalchemy/pull/251. If that
# fix isn't present, add this "dialect" attribute here:
if not hasattr(pybigquery.sqlalchemy_bigquery, "dialect"):
pybigquery.sqlalchemy_bigquery.dialect = (
pybigquery.sqlalchemy_bigquery.BigQueryDialect
)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this hasn't yet been added as a note in the Jira ticket, could you please do so? 😄

Excited to see what the resolution is here. Let's also make an internal reminder to clean up this note upon the PR's resolution.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you :) It's GREAT-173

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 106 to 107
# Sometimes "pybigquery.sqlalchemy_bigquery" fails to self-register in certain environments, so we do it explicitly.
# (see https://stackoverflow.com/questions/53284762/nosuchmoduleerror-cant-load-plugin-sqlalchemy-dialectssnowflake)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we happen to have a reason why? If so, could we add it to this note?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It appears to be something specific to how certain sqlalchemy packages are installed in Azure.. I'll add a note :)

Comment on lines 108 to 143
sqlalchemy.dialects.registry.register(
"bigquery", "pybigquery.sqlalchemy_bigquery", "BigQueryDialect"
)
try:
getattr(pybigquery.sqlalchemy_bigquery, "INTEGER")
bigquery_types_tuple = {}
BIGQUERY_TYPES = {
"INTEGER": pybigquery.sqlalchemy_bigquery.INTEGER,
"NUMERIC": pybigquery.sqlalchemy_bigquery.NUMERIC,
"STRING": pybigquery.sqlalchemy_bigquery.STRING,
"BIGNUMERIC": pybigquery.sqlalchemy_bigquery.BIGNUMERIC,
"BYTES": pybigquery.sqlalchemy_bigquery.BYTES,
"BOOL": pybigquery.sqlalchemy_bigquery.BOOL,
"BOOLEAN": pybigquery.sqlalchemy_bigquery.BOOLEAN,
"TIMESTAMP": pybigquery.sqlalchemy_bigquery.TIMESTAMP,
"TIME": pybigquery.sqlalchemy_bigquery.TIME,
"FLOAT": pybigquery.sqlalchemy_bigquery.FLOAT,
"DATE": pybigquery.sqlalchemy_bigquery.DATE,
"DATETIME": pybigquery.sqlalchemy_bigquery.DATETIME,
}
except AttributeError:
# In older versions of the pybigquery driver, types were not exported, so we use a hack
logger.warning(
"Old pybigquery driver version detected. Consider upgrading to 0.4.14 or later."
)
from collections import namedtuple

BigQueryTypes = namedtuple(
"BigQueryTypes", sorted(pybigquery.sqlalchemy_bigquery._type_map)
)
bigquery_types_tuple = BigQueryTypes(**pybigquery.sqlalchemy_bigquery._type_map)
except (ImportError, AttributeError):
bigquery_types_tuple = None
BigQueryDialect = None
pybigquery = None

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If John signed off on this, it is alright by me. I'm always wary of try/except but it is more of an accepted pattern in Python. If this still bugs you, maybe put a ticket in the backlog? Definitely not necessary though.

Comment on lines +1196 to +1201
"expect_column_values_to_be_between", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261
"expect_column_values_to_be_of_type", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261
"expect_column_values_to_be_in_set", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261
"expect_column_values_to_be_in_type_list", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261
"expect_column_values_to_match_like_pattern_list", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261
"expect_column_values_to_not_match_like_pattern_list", # unique to bigquery -- https://github.com/great-expectations/great_expectations/issues/3261
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these also noted in Jira/Confluence? If not, would you mind adding it there for improved visibility?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GREAT-166 is the one for these

Comment on lines +1257 to +1262
"expect_column_kl_divergence_to_be_less_than", # TODO: Takes over 64 minutes to "collect" (haven't actually seen it complete yet) -- https://github.com/great-expectations/great_expectations/issues/3260
"expect_column_values_to_be_in_set", # TODO: No matching signature for operator and AssertionError: expected ['2018-01-01T00:00:00'] but got ['2018-01-01'] -- https://github.com/great-expectations/great_expectations/issues/3260
"expect_column_values_to_be_in_type_list", # TODO: AssertionError -- https://github.com/great-expectations/great_expectations/issues/3260
"expect_column_values_to_be_between", # TODO: "400 No matching signature for operator >=" -- https://github.com/great-expectations/great_expectations/issues/3260
"expect_column_quantile_values_to_be_between", # TODO: takes over 15 minutes to "collect" (haven't actually seen it complete yet) -- https://github.com/great-expectations/great_expectations/issues/3260
"expect_column_mean_to_be_between", # TODO: "400 No matching signature for operator *" -- https://github.com/great-expectations/great_expectations/issues/3260
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with these items

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and GREAT-159 for these



def _bigquery_dataset() -> str:
return os.getenv("GE_TEST_BIGQUERY_DATASET", "test_ci")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming that the default value "test_ci" is not equivalent to the env var?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shoot I missed this one (yes it's equivalent to the env_var). I'll take out before merging

@Shinnnyshinshin Shinnnyshinshin enabled auto-merge (squash) August 18, 2021 00:16
@Shinnnyshinshin Shinnnyshinshin merged commit 2c89de6 into develop Aug 18, 2021
@Shinnnyshinshin Shinnnyshinshin deleted the feature/GREAT-66/enable-bigquery-tests-for-expectations branch August 18, 2021 00:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants