[FEA] MLOps for Dataset Drift detection #533

rnyak · 2021-01-21T03:23:40Z

When a model is deployed in production, detecting changes and anomalies in new incoming data is critical to make sure that the predictions are valid and can be safely consumed. Therefore, users should be able to analyze drift in their data to understand how it changes over time. Data drift is one of the main reasons for degradation in model accuracy over time. Data drift occurs when statistical properties of input variables (model input data) change (e.g., due to seasonality, personal preferences, trends change).
One type of data drift is covariate shift, which refers to the change in the distribution of the input variables present in the training and the new data. Another type of drift in ML is concept drift which is shift in the relationship between the independent and the target variable. Simply put, what we are trying to predict (statistical properties of target variable) change over time.

There are existing tools for data and model monitoring. Examples: Azure ML datadriftdetector Module, scikit-multiflow, Databricks + MLflow , and Amazon SageMaker Model Monitor.

We'll want to add some of the commonly measured and monitored components to the dataset evaluator and to the dataset generation tool.

To detect data drift, we may want to collect some stats:

min, max, median, variance, distinct counts for string data (categoricals)
IQR (Interquartile Range) this can be used for outlier detection
data types
number of classes in target variable
calculate mean within a window that stores the most recent W items received
missing column check : number of observed columns compared to baseline columns
missing and invalid value checks : % of nulls/NaN/invalid samples > threshold
relative frequencies of categorical features

To test distribution differences in data there are some tests:

Two-sample Kolmogorov–Smirnov test
Wasserstein distance
Kullback-Leibler Divergence

For model drift detection, we should save stats about model accuracy (we can call them accuracy drift metrics):

F1, precision, recall, accuracy, AUC value for reference model and during inference run

Basically compare base predictions with collected predictions.

Thresholds can be set by user so that if the difference is out of the threshold then an alert can be printed out.

rnyak · 2021-01-21T18:42:39Z

@albert17 could you pls review the PR and write down here what stats are already collected? Thanks.

albert17 · 2021-01-21T21:01:51Z

These are the stats being collected right now:

num_rows:

conts:
  col_name:
      dtype:
      min_val:
      max_val:
      mean:
      std:
      per_nan:
cats:
  col_name:
      dtype:
      cardinality:
      min_entry_size:
      max_entry_size:
      avg_entry_size:
      per_nan:
      multi_min:
      multi_max:
      multi_avg:

labels:
  col_name:
      dtype:
      cardinality:
      per_nan:

vinhngx · 2021-03-16T22:02:26Z

A very interesting and well written example by Google:
https://cloud.google.com/blog/topics/developers-practitioners/event-triggered-detection-data-drift-ml-workflows

vs385 · 2024-03-06T18:25:27Z

Does NVTabular currently support a data drift detection module? Or does it integrate with an existing tool such as Evidentlyai.com?

rnyak · 2024-03-08T20:14:12Z

@vs385 we don't have a specifically designed drift detection module, and we dont have integration with Evidentlyai.com either. You can look into that script if it might be useful or not.

benfred mentioned this issue Jan 21, 2021

Data inspect #521

Merged

benfred added this to To do in v0.5 Release via automation Jan 25, 2021

benfred removed this from To do in v0.5 Release Mar 9, 2021

EvenOldridge added the Iterative Deployment Iterative training, monitoring, and deployment label Apr 20, 2021

EvenOldridge added this to v0.7 in Future Releases Sketch Apr 20, 2021

viswa-nvidia added this to the NVTabular v0.7 milestone Apr 26, 2021

karlhigley moved this from 21.09 to Future (TBD) in Future Releases Sketch Jun 22, 2021

karlhigley removed this from the NVTabular v0.7 milestone Jun 22, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] MLOps for Dataset Drift detection #533

[FEA] MLOps for Dataset Drift detection #533

rnyak commented Jan 21, 2021 •

edited

rnyak commented Jan 21, 2021

albert17 commented Jan 21, 2021

vinhngx commented Mar 16, 2021

vs385 commented Mar 6, 2024

rnyak commented Mar 8, 2024 •

edited

[FEA] MLOps for Dataset Drift detection #533

[FEA] MLOps for Dataset Drift detection #533

Comments

rnyak commented Jan 21, 2021 • edited

rnyak commented Jan 21, 2021

albert17 commented Jan 21, 2021

vinhngx commented Mar 16, 2021

vs385 commented Mar 6, 2024

rnyak commented Mar 8, 2024 • edited

rnyak commented Jan 21, 2021 •

edited

rnyak commented Mar 8, 2024 •

edited