Skip to content

Commit

Permalink
Merge branch 'main' into 0.17.x
Browse files Browse the repository at this point in the history
# Conflicts:
#	VERSION
  • Loading branch information
noamzbr committed Jun 14, 2023
2 parents e1682c1 + 62a73ae commit f5c10d3
Show file tree
Hide file tree
Showing 36 changed files with 324 additions and 264 deletions.
80 changes: 44 additions & 36 deletions README.md
Expand Up @@ -35,11 +35,11 @@ Deepchecks is a holistic open-source solution for all of your AI & ML validation
enabling you to thoroughly test your data and models from research to production.


<a target="_blank" href="https://deepchecks.com/?utm_source=github.com&utm_medium=referral&utm_campaign=readme&utm_content=logo">
<a target="_blank" href="https://docs.deepchecks.com/?utm_source=github.com&utm_medium=referral&utm_campaign=readme&utm_content=logo">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="docs/source/_static/images/readme/cont_validation_dark.png">
<source media="(prefers-color-scheme: light)" srcset="docs/source/_static/images/readme/cont_validation_light.png">
<img alt="Deepchecks continuous validation parts." src="docs/source/_static/images//readme/cont_validation_light.png">
<source media="(prefers-color-scheme: dark)" srcset="docs/source/_static/images/readme/deepchecks_continuous_validation_dark.png">
<source media="(prefers-color-scheme: light)" srcset="docs/source/_static/images/readme/deepchecks_continuous_validation_light.png">
<img alt="Deepchecks continuous validation parts." src="docs/source/_static/images//readme/deepchecks_continuous_validation_light.png">
</picture>
</a>

Expand All @@ -56,29 +56,6 @@ enabling you to thoroughly test your data and models from research to production
</p>


## 🧮 How does it work?

At its core, deepchecks includes a wide variety of built-in Checks,
for testing all types of data and model related issues.
These checks are implemented for various models and data types (Tabular, NLP, Vision),
and can easily be customized and expanded.

The check results can be used to automatically make informed decisions
about your model's production-readiness, and for monitoring it over time in production.
The check results can be examined with visual reports (by saving them to an HTML file, or seeing them in Jupyter),
processed with code (using their pythonic / json output), and inspected and collaborated on with Deepchecks' dynamic UI
(for examining test results and for production monitoring).

<!---
At its core, Deepchecks has a wide variety of built-in Checks and Suites (lists of checks)
for all data types (Tabular, NLP, Vision),
These includes checks for validating your model's performance (e.g. identify weak segments), the data's
distribution (e.g. detect drifts or leakages), data integrity (e.g. find conflicting labels) and more.
These checks results can be run manually (e.g. during research) or trigerred automatically (e.g. during CI
and production monitoring) and enable automatically making informed decisions regarding your model pipelines'
production-readiness, and behavior over time.
--->

## 🧩 Components

Deepchecks includes:
Expand Down Expand Up @@ -126,18 +103,23 @@ Check out the full installation instructions for deepchecks testing [here](https

#### Deepchecks Monitoring Installation

To use deepchecks for production monitoring, you can either use our SaaS service, or deploy a local instance in one line on Linux/MacOS (Windows is WIP!) with Docker:
To use deepchecks for production monitoring, you can either use our SaaS service, or deploy a local instance in one line on Linux/MacOS (Windows is WIP!) with Docker.
Create a new directory for the installation files, open a terminal within that directory and run the following:

```
/bin/bash -c "$(curl -fsSL https://raw.githubusercontent.com/deepchecks/monitoring/main/deploy/deploy-oss.sh)"
pip install deepchecks-installer
deepchecks-installer monitoring-install
```

This will automatically download the necessary dependencies and start the application locally.
This will automatically download the necessary dependencies, run the installation prcoess
and then start the application locally.

The installation will take a few minutes. Then you can open the deployment url (default is http://localhost),
and start the system onboarding. Check out the full monitoring [open source installation & quickstart](https://docs.deepchecks.com/monitoring/stable/getting-started/deploy_self_host_open_source.html).

Note that the open source product is built such that each deployment supports monitoring of
a single model.

Check out the full installation instructions for deepchecks monitoring [here](https://docs.deepchecks.com/monitoring/stable/installation/index.html).

</details>

### 🏃‍♀️ Quickstarts
Expand Down Expand Up @@ -173,7 +155,7 @@ suite_result.save_as_html() # replace this with suite_result.show() or suite_res
The output will be a report that enables you to inspect the status and results of the chosen checks:

<p align="center">
<img src="docs/source/_static/images/general/model_evaluation_suite.gif" width="800">
<img src="docs/source/_static/images/readme/model-evaluation-suite.gif" width="600">
</p>

</details>
Expand All @@ -186,13 +168,13 @@ The output will be a report that enables you to inspect the status and results o
</summary>

Jump right into the
[monitoring quickstart docs](https://docs.deepchecks.com/monitoring/stable/user-guide/tabular/auto_quickstarts/plot_quickstart.html)
[open source monitoring quickstart docs](https://docs.deepchecks.com/monitoring/stable/getting-started/deploy_self_host_open_source.html)
to have it up and running on your data.
You'll then be able to see the checks results over time, set alerts, and interact
with the dynamic deepchecks UI that looks like this:

<p align="center">
<img src="docs/source/_static/images/general/monitoring-app-ui.gif" width="800">
<img src="docs/source/_static/images/general/monitoring-app-ui.gif" width="600">
</p>

</details>
Expand All @@ -208,14 +190,39 @@ Deepchecks managed CI & Testing management is currently in closed preview.
[Book a demo](https://deepchecks.com/book-demo/) for more information about the offering.

<p align="center">
<img src="docs/source/_static/images/general/deepchecks-ci-checks.png" width="800">
<img src="docs/source/_static/images/general/deepchecks-ci-checks.png" width="600">
</p>

For building and maintaining your own CI process while utilizing Deepchecks Testing for it,
check out our [docs for Using Deepchecks in CI/CD](https://docs.deepchecks.com/stable/general/usage/ci_cd.html).

</details>


## 🧮 How does it work?

At its core, deepchecks includes a wide variety of built-in Checks,
for testing all types of data and model related issues.
These checks are implemented for various models and data types (Tabular, NLP, Vision),
and can easily be customized and expanded.

The check results can be used to automatically make informed decisions
about your model's production-readiness, and for monitoring it over time in production.
The check results can be examined with visual reports (by saving them to an HTML file, or seeing them in Jupyter),
processed with code (using their pythonic / json output), and inspected and collaborated on with Deepchecks' dynamic UI
(for examining test results and for production monitoring).

<!---
At its core, Deepchecks has a wide variety of built-in Checks and Suites (lists of checks)
for all data types (Tabular, NLP, Vision),
These includes checks for validating your model's performance (e.g. identify weak segments), the data's
distribution (e.g. detect drifts or leakages), data integrity (e.g. find conflicting labels) and more.
These checks results can be run manually (e.g. during research) or trigerred automatically (e.g. during CI
and production monitoring) and enable automatically making informed decisions regarding your model pipelines'
production-readiness, and behavior over time.
--->


<details open>
<summary>
<h2>
Expand Down Expand Up @@ -250,6 +257,7 @@ processed with code (using their json output), and inspected and colloaborated u
Optional conditions can be added to each check, to automatically validate whether it passed or not.
--->


## 📜 Open Source vs Paid

Deepchecks' projects (``deepchecks/deepchecks`` & ``deepchecks/monitoring``) are open source and are released under [AGPL 3.0](./LICENSE).
Expand Down
2 changes: 1 addition & 1 deletion VERSION
@@ -1 +1 @@
0.17.2
0.17.3
144 changes: 77 additions & 67 deletions deepchecks/nlp/checks/data_integrity/text_property_outliers.py
Expand Up @@ -17,7 +17,7 @@

from deepchecks import ConditionCategory, ConditionResult
from deepchecks.core import CheckResult, DatasetKind
from deepchecks.core.errors import NotEnoughSamplesError
from deepchecks.core.errors import DeepchecksValueError, NotEnoughSamplesError
from deepchecks.nlp import Context, SingleDatasetCheck
from deepchecks.nlp.utils.nlp_plot import get_text_outliers_graph
from deepchecks.utils.dataframes import hide_index_for_display
Expand Down Expand Up @@ -46,7 +46,7 @@ class TextPropertyOutliers(SingleDatasetCheck):
sharp_drop_ratio : float, default : 0.9
The size of the sharp drop to detect categorical outliers
min_samples : int , default : 10
Minimum number of samples required to calculate IQR. If there are not enough non-null samples a specific
Minimum number of samples required to calculate IQR. If there are not enough non-null samples for a specific
property, the check will skip it. If all properties are skipped, the check will raise a NotEnoughSamplesError.
"""

Expand All @@ -73,85 +73,99 @@ def run_logic(self, context: Context, dataset_kind: DatasetKind) -> CheckResult:
cat_properties = dataset.categorical_properties
properties = df_properties.to_dict(orient='list')

if all(len(np.hstack(v).squeeze()) < self.min_samples for v in properties.values()):
raise NotEnoughSamplesError(f'Need at least {self.min_samples} non-null samples to calculate outliers.')

# The values are in the same order as the batch order, so always keeps the same order in order to access
# the original sample at this index location
for name, values in properties.items():
# If the property is single value per sample, then wrap the values in list in order to work on fixed
# structure
if not isinstance(values[0], list):
values = [[x] for x in values]

is_numeric = name not in cat_properties

if is_numeric:
values_arr = np.hstack(values).astype(float).squeeze()
values_arr = np.array([x for x in values_arr if pd.notnull(x)])
else:
values_arr = np.hstack(values).astype(str).squeeze()
try:
if not isinstance(values[0], list):
if is_numeric:
# Check for non numeric data in the column
curr_nan_count = pd.isnull(values).sum()
values = pd.to_numeric(values, errors='coerce')
updated_nan_count = pd.isnull(values).sum()
if updated_nan_count > curr_nan_count:
raise DeepchecksValueError('Numeric property contains non-numeric values.')
# If the property is single value per sample, then wrap the values in list in order to
# work on fixed structure
values = [[x] for x in values]

if is_numeric:
values_arr = np.hstack(values).astype(float).squeeze()
values_arr = np.array([x for x in values_arr if pd.notnull(x)])
else:
values_arr = np.hstack(values).astype(str).squeeze()

if len(values_arr) < self.min_samples:
result[name] = 'Not enough non-null samples to calculate outliers.'
continue
if len(values_arr) < self.min_samples:
raise NotEnoughSamplesError(f'Not enough non-null samples to calculate outliers'
f'(min_samples={self.min_samples}).')

if is_numeric:
lower_limit, upper_limit = iqr_outliers_range(values_arr, self.iqr_percentiles,
self.iqr_scale, self.sharp_drop_ratio)
else:
# Counting the frequency of each category. Normalizing because distribution graph shows the percentage.
counts_map = pd.Series(values_arr.astype(str)).value_counts(normalize=True).to_dict()
lower_limit = sharp_drop_outliers_range(sorted(list(counts_map.values()), reverse=True),
self.sharp_drop_ratio) or 0
upper_limit = len(values_arr) # No upper limit for categorical properties
values_arr = np.array([counts_map[x] for x in values_arr]) # replace the values with the counts

# Get the indices of the outliers
top_outliers = np.argwhere(values_arr > upper_limit).squeeze(axis=1)
# Sort the indices of the outliers by the original values
top_outliers = top_outliers[
np.apply_along_axis(lambda i, sort_arr=values_arr: sort_arr[i], axis=0, arr=top_outliers).argsort()
]

# Doing the same for bottom outliers
bottom_outliers = np.argwhere(values_arr < lower_limit).squeeze(axis=1)
# Sort the indices of the outliers by the original values
bottom_outliers = bottom_outliers[
np.apply_along_axis(lambda i, sort_arr=values_arr: sort_arr[i], axis=0, arr=bottom_outliers).argsort()
]

text_outliers = np.concatenate([bottom_outliers, top_outliers])

result[name] = {
'indices': [dataset.get_original_text_indexes()[i] for i in text_outliers],
# For the upper and lower limits doesn't show values that are smaller/larger than the actual values
# we have in the data
'lower_limit': max(lower_limit, min(values_arr)),
'upper_limit': min(upper_limit, max(values_arr)) if is_numeric else None,
'outlier_ratio': len(text_outliers) / len(values_arr)
}
if is_numeric:
lower_limit, upper_limit = iqr_outliers_range(values_arr, self.iqr_percentiles,
self.iqr_scale, self.sharp_drop_ratio)
else:
# Counting the frequency of each category. Normalizing because distribution graph shows percentage.
counts_map = pd.Series(values_arr.astype(str)).value_counts(normalize=True).to_dict()
lower_limit = sharp_drop_outliers_range(sorted(list(counts_map.values()), reverse=True),
self.sharp_drop_ratio) or 0
upper_limit = len(values_arr) # No upper limit for categorical properties
values_arr = np.array([counts_map[x] for x in values_arr]) # replace the values with the counts

# Get the indices of the outliers
top_outliers = np.argwhere(values_arr > upper_limit).squeeze(axis=1)
# Sort the indices of the outliers by the original values
top_outliers = top_outliers[
np.apply_along_axis(lambda i, sort_arr=values_arr: sort_arr[i], axis=0, arr=top_outliers).argsort()
]

# Doing the same for bottom outliers
bottom_outliers = np.argwhere(values_arr < lower_limit).squeeze(axis=1)
# Sort the indices of the outliers by the original values
bottom_outliers = bottom_outliers[
np.apply_along_axis(lambda i, sort_arr=values_arr: sort_arr[i],
axis=0, arr=bottom_outliers).argsort()
]

text_outliers = np.concatenate([bottom_outliers, top_outliers])

result[name] = {
'indices': [dataset.get_original_text_indexes()[i] for i in text_outliers],
# For the upper and lower limits doesn't show values that are smaller/larger than the actual values
# we have in the data
'lower_limit': max(lower_limit, min(values_arr)),
'upper_limit': min(upper_limit, max(values_arr)) if is_numeric else None,
'outlier_ratio': len(text_outliers) / len(values_arr),
}
except Exception as exp: # pylint: disable=broad-except
result[name] = f'{exp}'

# Create display
if context.with_display:
display = []
no_outliers = pd.Series([], dtype='object')

sorted_result_items = sorted(result.items(), key=lambda x: len(x[1]['indices']), reverse=True)
# Sort the result map based on the length of indices and if there are any error message associated to
# any property, keep that property at the very end.
sorted_result_items = sorted(result.items(),
key=lambda x: len(x[1].get('indices', [])) if isinstance(x[1], dict) else 0,
reverse=True)

for property_name, info in sorted_result_items:

# If info is string it means there was error
if isinstance(info, str):
no_outliers = pd.concat([no_outliers, pd.Series(property_name, index=[info])])
elif len(info['indices']) == 0:
no_outliers = pd.concat([no_outliers, pd.Series(property_name, index=['No outliers found.'])])
else:
if len(display) < self.n_show_top:
dist = df_properties[property_name]
if len(dist[~pd.isnull(dist)]) >= self.min_samples:
lower_limit = info['lower_limit']
upper_limit = info['upper_limit']
if property_name not in cat_properties:
dist = df_properties[property_name].astype(float)
else:
dist = df_properties[property_name]
lower_limit = info['lower_limit']
upper_limit = info['upper_limit']

try:
fig = get_text_outliers_graph(
dist=dist,
data=dataset.text,
Expand All @@ -162,13 +176,9 @@ def run_logic(self, context: Context, dataset_kind: DatasetKind) -> CheckResult:
)

display.append(fig)
else:
no_outliers = pd.concat(
[no_outliers, pd.Series(property_name, index=[
f'Not enough non-null samples to compute'
f' properties (min_samples={self.min_samples}).'
])]
)
except Exception as exp: # pylint: disable=broad-except
result[property_name] = f'{exp}'
no_outliers = pd.concat([no_outliers, pd.Series(property_name, index=[exp])])
else:
no_outliers = pd.concat([no_outliers, pd.Series(property_name, index=[
f'Outliers found but not shown in graphs (n_show_top={self.n_show_top}).'])])
Expand Down

0 comments on commit f5c10d3

Please sign in to comment.