Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] MLOps for Dataset Drift detection #533

Open
rnyak opened this issue Jan 21, 2021 · 5 comments
Open

[FEA] MLOps for Dataset Drift detection #533

rnyak opened this issue Jan 21, 2021 · 5 comments
Labels
Iterative Deployment Iterative training, monitoring, and deployment

Comments

@rnyak
Copy link
Contributor

rnyak commented Jan 21, 2021

When a model is deployed in production, detecting changes and anomalies in new incoming data is critical to make sure that the predictions are valid and can be safely consumed. Therefore, users should be able to analyze drift in their data to understand how it changes over time. Data drift is one of the main reasons for degradation in model accuracy over time. Data drift occurs when statistical properties of input variables (model input data) change (e.g., due to seasonality, personal preferences, trends change).
One type of data drift is covariate shift, which refers to the change in the distribution of the input variables present in the training and the new data. Another type of drift in ML is concept drift which is shift in the relationship between the independent and the target variable. Simply put, what we are trying to predict (statistical properties of target variable) change over time.

There are existing tools for data and model monitoring. Examples: Azure ML datadriftdetector Module, scikit-multiflow, Databricks + MLflow , and Amazon SageMaker Model Monitor.

We'll want to add some of the commonly measured and monitored components to the dataset evaluator and to the dataset generation tool.

To detect data drift, we may want to collect some stats:

  • min, max, median, variance, distinct counts for string data (categoricals)
  • IQR (Interquartile Range) this can be used for outlier detection
  • data types
  • number of classes in target variable
  • calculate mean within a window that stores the most recent W items received
  • missing column check : number of observed columns compared to baseline columns
  • missing and invalid value checks : % of nulls/NaN/invalid samples > threshold
  • relative frequencies of categorical features

To test distribution differences in data there are some tests:

  • Two-sample Kolmogorov–Smirnov test
  • Wasserstein distance
  • Kullback-Leibler Divergence

For model drift detection, we should save stats about model accuracy (we can call them accuracy drift metrics):

  • F1, precision, recall, accuracy, AUC value for reference model and during inference run

Basically compare base predictions with collected predictions.

  • Thresholds can be set by user so that if the difference is out of the threshold then an alert can be printed out.
@rnyak
Copy link
Contributor Author

rnyak commented Jan 21, 2021

@albert17 could you pls review the PR and write down here what stats are already collected? Thanks.

@benfred benfred mentioned this issue Jan 21, 2021
@albert17
Copy link
Contributor

These are the stats being collected right now:

num_rows:

conts:
  col_name:
      dtype:
      min_val:
      max_val:
      mean:
      std:
      per_nan:
cats:
  col_name:
      dtype:
      cardinality:
      min_entry_size:
      max_entry_size:
      avg_entry_size:
      per_nan:
      multi_min:
      multi_max:
      multi_avg:

labels:
  col_name:
      dtype:
      cardinality:
      per_nan:

@benfred benfred added this to To do in v0.5 Release via automation Jan 25, 2021
@benfred benfred removed this from To do in v0.5 Release Mar 9, 2021
@vinhngx
Copy link
Contributor

vinhngx commented Mar 16, 2021

A very interesting and well written example by Google:
https://cloud.google.com/blog/topics/developers-practitioners/event-triggered-detection-data-drift-ml-workflows

@EvenOldridge EvenOldridge added the Iterative Deployment Iterative training, monitoring, and deployment label Apr 20, 2021
@viswa-nvidia viswa-nvidia added this to the NVTabular v0.7 milestone Apr 26, 2021
@karlhigley karlhigley moved this from 21.09 to Future (TBD) in Future Releases Sketch Jun 22, 2021
@karlhigley karlhigley removed this from the NVTabular v0.7 milestone Jun 22, 2021
@vs385
Copy link

vs385 commented Mar 6, 2024

Does NVTabular currently support a data drift detection module? Or does it integrate with an existing tool such as Evidentlyai.com?

@rnyak
Copy link
Contributor Author

rnyak commented Mar 8, 2024

@vs385 we don't have a specifically designed drift detection module, and we dont have integration with Evidentlyai.com either. You can look into that script if it might be useful or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Iterative Deployment Iterative training, monitoring, and deployment
Projects
No open projects
Development

No branches or pull requests

7 participants