Merge branch 'feature/add_performance_test' of https://github.com/jdi…

…matteo/great_expectations into feature/add_performance_test * 'feature/add_performance_test' of https://github.com/jdimatteo/great_expectations: [MAINTENANCE] Tests for RuntimeDataConnector at Datasource-level (Spark and Pandas) (great-expectations#3318) [MAINTENANCE] Tests for RuntimeDataConnector at DataContext-level (great-expectations#3304) [BUGFIX] SQL dialect doesn't register for BigQuery for V2 (great-expectations#3324) [WIP] [FEATURE] add backend args to run_diagnostics (great-expectations#3257) Release Prep release-prep-2021-08-26 (great-expectations#3320) Docs] hide stubbed core skills (great-expectations#3316) [MAINTENANCE] Write integration/E2E tests for both `GCSDataConnectors` (great-expectations#3301) [DOCS] Standardize capitalization of various technologies in `docs` (great-expectations#3312) [DOCS] Fix misc errors in "How to create renderers for Custom Expectations" (great-expectations#3315) docs: Remove misc TODOs to tidy up docs (great-expectations#3313) [DOCS] GDOC-217 remove stub links (great-expectations#3314) [FEATURE] Enable `GCS DataConnector` integration with `PandasExecutionEngine` (great-expectations#3264)
jdimatteo · Aug 31, 2021 · 97d2dd1 · 97d2dd1
2 parents 23f3c4d + f3d430d
commit 97d2dd1
Show file tree

Hide file tree

Showing 100 changed files with 32,727 additions and 631 deletions.
diff --git a/...rimental/great_expectations_experimental/expectations/expect_column_skew_to_be_between.py b/...rimental/great_expectations_experimental/expectations/expect_column_skew_to_be_between.py
@@ -1,4 +1,6 @@
 import json
+import logging
+import traceback
 from typing import Any, Dict, Optional, Tuple
 
 import numpy as np
@@ -26,11 +28,13 @@
     ColumnMetricProvider,
     column_aggregate_value,
 )
-from great_expectations.expectations.metrics.import_manager import F, sa
-from great_expectations.expectations.metrics.metric_provider import (
-    MetricProvider,
-    metric_value,
+from great_expectations.expectations.metrics.column_aggregate_metric_provider import (
+    ColumnAggregateMetricProvider,
+    column_aggregate_partial,
+    column_aggregate_value,
 )
+from great_expectations.expectations.metrics.import_manager import F, sa
+from great_expectations.expectations.metrics.metric_provider import metric_value
 from great_expectations.expectations.util import render_evaluation_parameter_string
 from great_expectations.render.renderer.renderer import renderer
 from great_expectations.render.types import RenderedStringTemplateContent
@@ -42,6 +46,32 @@
 )
 from great_expectations.validator.validation_graph import MetricConfiguration
 
+logger = logging.getLogger(__name__)
+
+try:
+    from sqlalchemy.exc import ProgrammingError
+    from sqlalchemy.sql import Select
+except ImportError:
+    logger.debug(
+        "Unable to load SqlAlchemy context; install optional sqlalchemy dependency for support"
+    )
+    ProgrammingError = None
+    Select = None
+
+try:
+    from sqlalchemy.engine.row import Row
+except ImportError:
+    try:
+        from sqlalchemy.engine.row import RowProxy
+
+        Row = RowProxy
+    except ImportError:
+        logger.debug(
+            "Unable to load SqlAlchemy Row class; please upgrade you sqlalchemy installation to the latest version."
+        )
+        RowProxy = None
+        Row = None
+
 
 class ColumnSkew(ColumnMetricProvider):
     """MetricProvider Class for Aggregate Mean MetricProvider"""
@@ -55,34 +85,80 @@ def _pandas(cls, column, abs=False, **kwargs):
             return np.abs(stats.skew(column))
         return stats.skew(column)
 
-    #
-    # @metric_value(engine=SqlAlchemyExecutionEngine, metric_fn_type="value")
-    # def _sqlalchemy(
-    #     cls,
-    #     execution_engine: "SqlAlchemyExecutionEngine",
-    #     metric_domain_kwargs: Dict,
-    #     metric_value_kwargs: Dict,
-    #     metrics: Dict[Tuple, Any],
-    #     runtime_configuration: Dict,
-    # ):
-    #     (
-    #         selectable,
-    #         compute_domain_kwargs,
-    #         accessor_domain_kwargs,
-    #     ) = execution_engine.get_compute_domain(
-    #         metric_domain_kwargs, MetricDomainTypes.COLUMN
-    #     )
-    #     column_name = accessor_domain_kwargs["column"]
-    #     column = sa.column(column_name)
-    #     sqlalchemy_engine = execution_engine.engine
-    #     dialect = sqlalchemy_engine.dialect
-    #
-    #     column_median = None
-    #
-    #     # TODO: compute the value and return it
-    #
-    #     return column_median
-    #
+    @metric_value(engine=SqlAlchemyExecutionEngine)
+    def _sqlalchemy(
+        cls,
+        execution_engine: "SqlAlchemyExecutionEngine",
+        metric_domain_kwargs: Dict,
+        metric_value_kwargs: Dict,
+        metrics: Dict[Tuple, Any],
+        runtime_configuration: Dict,
+    ):
+        (
+            selectable,
+            compute_domain_kwargs,
+            accessor_domain_kwargs,
+        ) = execution_engine.get_compute_domain(
+            metric_domain_kwargs, MetricDomainTypes.COLUMN
+        )
+
+        column_name = accessor_domain_kwargs["column"]
+        column = sa.column(column_name)
+        sqlalchemy_engine = execution_engine.engine
+        dialect = sqlalchemy_engine.dialect
+
+        column_mean = _get_query_result(
+            func=sa.func.avg(column * 1.0),
+            selectable=selectable,
+            sqlalchemy_engine=sqlalchemy_engine,
+        )
+
+        column_count = _get_query_result(
+            func=sa.func.count(column),
+            selectable=selectable,
+            sqlalchemy_engine=sqlalchemy_engine,
+        )
+
+        if dialect.name.lower() == "mssql":
+            standard_deviation = sa.func.stdev(column)
+        else:
+            standard_deviation = sa.func.stddev_samp(column)
+
+        column_std = _get_query_result(
+            func=standard_deviation,
+            selectable=selectable,
+            sqlalchemy_engine=sqlalchemy_engine,
+        )
+
+        column_third_moment = _get_query_result(
+            func=sa.func.sum(sa.func.pow(column - column_mean, 3)),
+            selectable=selectable,
+            sqlalchemy_engine=sqlalchemy_engine,
+        )
+
+        column_skew = column_third_moment / (column_std ** 3) / (column_count - 1)
+        if metric_value_kwargs["abs"]:
+            return np.abs(column_skew)
+        else:
+            return column_skew
+
+
+def _get_query_result(func, selectable, sqlalchemy_engine):
+    simple_query: Select = sa.select(func).select_from(selectable)
+
+    try:
+        result: Row = sqlalchemy_engine.execute(simple_query).fetchone()[0]
+        return result
+    except ProgrammingError as pe:
+        exception_message: str = "An SQL syntax Exception occurred."
+        exception_traceback: str = traceback.format_exc()
+        exception_message += (
+            f'{type(pe).__name__}: "{str(pe)}".  Traceback: "{exception_traceback}".'
+        )
+        logger.error(exception_message)
+        raise pe()
+
+        #
     # @metric_value(engine=SparkDFExecutionEngine, metric_fn_type="value")
     # def _spark(
     #     cls,
@@ -229,27 +305,31 @@ class ExpectColumnSkewToBeBetween(ColumnExpectation):
                     "title": "positive_test_positive_skew",
                     "exact_match_out": False,
                     "include_in_gallery": True,
+                    "tolerance": 0.1,
                     "in": {"column": "a", "min_value": 0.25, "max_value": 10},
                     "out": {"success": True, "observed_value": 1.6974323016687487},
                 },
                 {
                     "title": "negative_test_no_skew",
                     "exact_match_out": False,
                     "include_in_gallery": True,
+                    "tolerance": 0.1,
                     "in": {"column": "b", "min_value": 0.25, "max_value": 10},
                     "out": {"success": False, "observed_value": -0.07638895580386174},
                 },
                 {
                     "title": "positive_test_negative_skew",
                     "exact_match_out": False,
                     "include_in_gallery": True,
+                    "tolerance": 0.1,
                     "in": {"column": "c", "min_value": -10, "max_value": -0.5},
                     "out": {"success": True, "observed_value": -0.9979514313860596},
                 },
                 {
                     "title": "negative_test_abs_skew",
                     "exact_match_out": False,
                     "include_in_gallery": True,
+                    "tolerance": 0.1,
                     "in": {
                         "column": "c",
                         "abs": True,
@@ -262,6 +342,7 @@ class ExpectColumnSkewToBeBetween(ColumnExpectation):
                     "title": "positive_test_abs_skew",
                     "exact_match_out": False,
                     "include_in_gallery": True,
+                    "tolerance": 0.1,
                     "in": {
                         "column": "c",
                         "abs": True,
@@ -271,7 +352,17 @@ class ExpectColumnSkewToBeBetween(ColumnExpectation):
                     "out": {"success": True, "observed_value": 0.9979514313860596},
                 },
             ],
-        },
+            "test_backends": [
+                {
+                    "backend": "pandas",
+                    "dialects": None,
+                },
+                {
+                    "backend": "sqlalchemy",
+                    "dialects": ["mysql", "postgresql"],
+                },
+            ],
+        }
     ]
 
     # This dictionary contains metadata for display in the public gallery
@@ -401,4 +492,5 @@ def _validate(
 
 if __name__ == "__main__":
     self_check_report = ExpectColumnSkewToBeBetween().run_diagnostics()
+
     print(json.dumps(self_check_report, indent=2))
diff --git a/docs/README.md b/docs/README.md
@@ -51,12 +51,12 @@ If you are using GitHub pages for hosting, this command is a convenient way to b
 
 ## Other relevant files
 
-The following are a few details about other files docusaurus uses that you may wish to be familiar with.
+The following are a few details about other files Docusaurus uses that you may wish to be familiar with.
 
-- `../sidebars.js`: javascript that specifies the sidebar/navigation used in docs pages
+- `../sidebars.js`: JavaScript that specifies the sidebar/navigation used in docs pages
 - `../src`: non-docs pages live here
-- `../static`: static assets used in docs pages (such as css) live here
-- `../docusaurus.config.js`: the configuration file for docusaurus
-- `../babel.config.js`: babel config file used when building
+- `../static`: static assets used in docs pages (such as CSS) live here
+- `../docusaurus.config.js`: the configuration file for Docusaurus
+- `../babel.config.js`: Babel config file used when building
 - `../package.json`: dependencies and scripts
 - `../yarn.lock`: dependency lock file that ensures reproducibility
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -3,6 +3,23 @@ title: Changelog
 ---
 
 ### Develop
+* [FEATURE] Add "test_backends" key to Expectation.examples for specifying test backends and dialects (#3257) 
+
+
+### 0.13.31
+* [FEATURE] Enable `GCS DataConnector` integration with `PandasExecutionEngine` (#3264)
+* [FEATURE] Enable column_pair expectations and tests for Spark (#3294)
+* [FEATURE] Implement `InferredAssetGCSDataConnector` (#3284)
+* [FEATURE]/CHANGE run time format (#3272) (Thanks @serialbandicoot)
+* [DOCS] Fix misc errors in "How to create renderers for Custom Expectations" (#3315)
+* [DOCS] GDOC-217 remove stub links (#3314)
+* [DOCS] Remove misc TODOs to tidy up docs (#3313)
+* [DOCS] Standardize capitalization of various technologies in `docs` (#3312)
+* [DOCS] Fix broken link to Contributor docs (#3295) (Thanks @discdiver)
+* [MAINTENANCE] Additional tests for RuntimeDataConnector at Datasource-level (query) (#3288)
+* [MAINTENANCE] Update GCSStoreBackend + tests (#2630) (Thanks @hmandsager)
+* [MAINTENANCE] Write integration/E2E tests for `ConfiguredAssetAzureDataConnector` (#3204)
+* [MAINTENANCE] Write integration/E2E tests for both `GCSDataConnectors` (#3301)
 
 ### 0.13.30
 * [FEATURE] Implement Spark Decorators and Helpers; Demonstrate on MulticolumnSumEqual Metric (#3289)

diff --git a/docs/contributing/contributing_checklist.md b/docs/contributing/contributing_checklist.md
@@ -32,15 +32,15 @@ Once your code is ready, please go through the following checklist before submit
 
 * You can also rebase your branch from upstream/develop. In general, the steps are:
 
-	* Run git fetch upstream then git rebase upstream/develop.
+	* Run `git fetch` upstream then `git rebase upstream/develop`.
 
 	* Fix any merge conflicts that arise from the rebase.
 
 	* Make sure to add and commit all your changes in this step.
 
 	* Re-run tests to ensure the rebase did not introduce any new issues.
 
-* Atlassian and Github both have good tutorials for rebasing: [Atlassian’s tutorial](https://www.atlassian.com/git/tutorials/git-forks-and-upstreams), [Github’s tutorial](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork).
+* Atlassian and GitHub both have good tutorials for rebasing: [Atlassian’s tutorial](https://www.atlassian.com/git/tutorials/git-forks-and-upstreams), [GitHub’s tutorial](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests/syncing-a-fork).
 
 #### 4. Have you written and run all the tests you need?
 

diff --git a/docs/contributing/contributing_misc.md b/docs/contributing/contributing_misc.md
@@ -19,7 +19,7 @@ GE core team members use this checklist to ship releases.
 
 1. If this is a major release (incrementing either the first or second version number) the manual acceptance testing must be completed.
 
-* This [private google doc](https://docs.google.com/document/d/16QJPSCawEkwuEjShZeHa01TlQm9nbUwS6GwmFewJ3EY) outlines the procedure. (Note this will be made public eventually)
+* This [private Google Doc](https://docs.google.com/document/d/16QJPSCawEkwuEjShZeHa01TlQm9nbUwS6GwmFewJ3EY) outlines the procedure. (Note this will be made public eventually)
 
 2. Merge all approved PRs into `develop`.
 
@@ -48,17 +48,17 @@ GE core team members use this checklist to ship releases.
 11. Check [PyPI](https://pypi.org/project/great-expectations/#history) for the new release
 
 
-12. Create an annotated git tag:
+12. Create an annotated Git tag:
 
 * Run `git tag -a $VERSION -m $VERSION` with the correct new version.
 
 * Push the tag up by running `git push origin $VERSION` with the correct new version.
 
-* Merge main into develop so that the tagged commit becomes part of the history for develop: git checkout develop; git pull; git merge main
+* Merge main into develop so that the tagged commit becomes part of the history for develop: `git checkout develop; git pull; git merge main`
 
 * On develop, add a new “Develop” section header to changelog.md, and push the updated file with message “Update changelog for develop”
 
-13. [Create the release on GitHub](https://github.com/great-expectations/great_expectations/releases) with the version number. Copy the changelog notes into the release notes, and update any rst-specific links to use github issue numbers.
+13. [Create the release on GitHub](https://github.com/great-expectations/great_expectations/releases) with the version number. Copy the changelog notes into the release notes, and update any rst-specific links to use GitHub issue numbers.
 
 * The deploy step will automatically create a draft for the release.
 

diff --git a/docs/contributing/contributing_setup.md b/docs/contributing/contributing_setup.md
@@ -65,7 +65,7 @@ This is not required, but highly recommended.
 
 * This will ensure that sure you have the right libraries installed in your Python environment.
 
-	* Note that you can also substitute requirements-dev-test.txt to only install requirements required for testing all backends, and requirements-dev-spark.txt or requirements-dev-sqlalchemy.txt if you would like to add support for Spark or sqlalchemy tests, respectively. For some database backends, such as MSSQL additional driver installation may required in your environment; see below for more information.
+	* Note that you can also substitute requirements-dev-test.txt to only install requirements required for testing all backends, and requirements-dev-spark.txt or requirements-dev-sqlalchemy.txt if you would like to add support for Spark or SQLAlchemy tests, respectively. For some database backends, such as MSSQL additional driver installation may required in your environment; see below for more information.
 
 	* [Installing Microsoft ODBC driver for MacOS](https://docs.microsoft.com/en-us/sql/connect/odbc/linux-mac/install-microsoft-odbc-driver-sql-server-macos)
 
@@ -78,11 +78,11 @@ This is not required, but highly recommended.
 	*`-e` will install Great Expectations in “editable” mode. This is not required, but is often very convenient as a developer.
 
 ### (Optional) Configure resources for testing and documentation
-Depending on which features of Great Expectations you want to work on, you may want to configure different backends for local testing, such as postgresql and Spark. Also, there are a couple of extra steps if you want to build documentation locally.
+Depending on which features of Great Expectations you want to work on, you may want to configure different backends for local testing, such as PostgreSQL and Spark. Also, there are a couple of extra steps if you want to build documentation locally.
 
-#### If you want to develop against local postgresql:
+#### If you want to develop against local PostgreSQL:
 
-* To simplify setup, the repository includes a `docker-compose` file that can stand up a local postgresql container. To use it, you’ll need to have [docker installed](https://docs.docker.com/install/).
+* To simplify setup, the repository includes a `docker-compose` file that can stand up a local PostgreSQL container. To use it, you’ll need to have [Docker installed](https://docs.docker.com/install/).
 
 * Navigate to `assets/docker/postgresql` in your `great_expectations` repo and run `docker-compose up -d`
 
@@ -96,7 +96,7 @@ Depending on which features of Great Expectations you want to work on, you may w
 
 * Once you’re done testing, you can shut down your postgesql container by running `docker-compose down` from the same directory.
 
-* Caution: If another service is using port 5432, docker may start the container but silently fail to set up the port. In that case, you will probably see errors like this:
+* Caution: If another service is using port 5432, Docker may start the container but silently fail to set up the port. In that case, you will probably see errors like this:
 
 	````console
 	psycopg2.OperationalError: could not connect to server: Connection refused
@@ -116,7 +116,7 @@ Depending on which features of Great Expectations you want to work on, you may w
 
 #### If you want to develop against local mysql:
 
-* To simplify setup, the repository includes a `docker-compose` file that can stand up a local mysqldb container. To use it, you’ll need to have [docker installed](https://docs.docker.com/install/).
+* To simplify setup, the repository includes a `docker-compose` file that can stand up a local mysqldb container. To use it, you’ll need to have [Docker installed](https://docs.docker.com/install/).
 
 * Navigate to `assets/docker/mysql` in your `great_expectations` repo and run `docker-compose up -d`
 
@@ -130,7 +130,7 @@ Depending on which features of Great Expectations you want to work on, you may w
 
 * Once you’re done testing, you can shut down your mysql container by running `docker-compose down` from the same directory.
 
-* Caution: If another service is using port 3306, docker may start the container but silently fail to set up the port.
+* Caution: If another service is using port 3306, Docker may start the container but silently fail to set up the port.
 
 #### If you want to develop against local Spark:
 

diff --git a/docs/contributing/contributing_style.md b/docs/contributing/contributing_style.md
@@ -95,7 +95,7 @@ Within the table of contents, each section has specific role to play. Broadly sp
 
 * **Tutorials** help users and contributors get started quickly. Along the way they orient new users to concepts that will be important to know later.
 
-* **How-to guides** help users accomplish specific goals that go beyond the generic tutorials. Article titles within this section always start with “How to”: “How to create custom Expectations”. They often reference specific tools or infrastructure: “How to validate Expectations from within a notebook”, “How to build data docs in S3.”
+* **How-to guides** help users accomplish specific goals that go beyond the generic tutorials. Article titles within this section always start with “How to”: “How to create custom Expectations”. They often reference specific tools or infrastructure: “How to validate Expectations from within a notebook”, “How to build Data Docs in S3.”
 
 * **Reference** articles explain the architecture of Great Expectations. These articles explain core concepts, discuss alternatives and options, and provide context, history, and direction for the project. Reference articles avoid giving specific technical advice. They also avoid implementation details that can be captured in docstrings instead.