Merge pull request #62 from prio-data/drift_detection

Drift detection
prio-data · Apr 29, 2024 · 3149910 · 3149910
2 parents d1daaff + f0dcc20
commit 3149910
Show file tree

Hide file tree

Showing 7 changed files with 799 additions and 8 deletions.
diff --git a/README.md b/README.md
@@ -74,13 +74,23 @@ Show docstring for a particular transform:
 
 `viewser transforms show <transform-name>`
 
+List querysets stored in the queryset database:
+
+`viewser querysets list`
+
+Produce code required to generate a queryset
+
+`viewser querysets show <queryset-name>`
+
 ## Via API
 
 The full functionality of viewser is exposed via its API for use in scripts and notebooks
 
 The two fundamental objects used to define what data is fetched by the client are the *Queryset* and the *Column*, where a 
 Queryset consists of one or more Columns. 
 
+### Defining a new queryset from scratch
+
 To define a queryset, one first imports the Queryset and Column classes
 
 `from viewser import Queryset, Column`
@@ -124,7 +134,7 @@ If the wrong loa is specified, the queryset will be rejected by the server and a
 
 The final argument to the Column instance is the name of the raw column to be fetched from the database. If a non-existant column is requested, the queryset will be rejected by the server and an error message detailing which columns are unavailable will be returned.
 
-## Aggregation/disaggregation
+#### Aggregation/disaggregation
 
 The definition of a queryset must include the *target* level of analysis, at which the resulting data will be presented to the user.
 
@@ -153,7 +163,7 @@ It is up to users to ensure that they select the correct aggregation functions.
 
 If a non-existent aggregation function is specified, the queryset will be rejected by the server and an error message detailing which columns have incorrect aggregation functions will be returned.
 
-## Transforms
+#### Transforms
 
 Any queryset column may specify an arbitrary number of transforms to be done to the raw data *after* any necessary aggregation/disaggregation has been done.
 
@@ -174,13 +184,29 @@ A list of available transforms can be obtained using the viewser CLI. A notebook
 
 Note that not all transforms are available at all levels of analysis. If a transform is requested at an innapproprite loa, the queryset will be rejected by the server and an error message detailing which columns have requested incompatible transforms and loas will be returned.
 
-## Publising a queryset
+### Making a new queryset by merging two or more existing querysets
+
+It is sometimes desirable to make a larger queryset by merging several existing querysets. This can be done with the from_merger method. The method requires at mininum a list of querysets to be merged and a name for the merged queryset. Optionally, a theme and description can also be passed. There is also a boolean verbose flag, described below.
+For example
+
+    querysets_to_merge = ['queryset1','querysets2','queryset3']
+    merged_queryset = Queryset.from_merger(querysets_to_merge,'my_merged_queryset',theme='my_theme',description='description')
+
+Before merging, some checks are performed. The querysets to be merged must all have the same target LOA. If the querysets to be merged contain two or more columns with the same name, the method checks that all the definitions of that column are exactly the same (same raw data, same transforms with same parameters). If this is the case, one copy of this column is included in the merged queryset (if the verbose flag is True, the method reports that this has been done). If there are multiple definitions of the columns with the same column name, the attempt at merging is aborted.
+
+### Recreating a queryset from storage
+
+If a queryset has already been published to the queryset store (see below), the queryset object can be regenerated by doing
+
+    queryset = Queryset.from_storage(queryset_name)
+
+### Publising a queryset
 
-Before a queryset can be fetched, it must be published to a permanent database on the server. This is done using the `publish()` method:
+Before a new queryset (written from scratch or created by merging existing querysets) can be fetched, it must be published to a permanent database on the server. This is done using the `publish()` method:
 
     data = new_queryset.publish()
 
-## Fetching a queryset
+### Fetching a queryset
 
 A published queryset can be fetched using the `.fetch()` method
 
@@ -269,6 +295,56 @@ This message indicates only that the queryset is waiting in the transform queue.
 
 When all transforms have completed, downloading of the completed dataframe begins. The status message at this point will cease updating, which can make it appear that the client is hung in the case of large querysets. Users are asked to be patient :) .
 
+## Input drift detection
+
+The viewser package is able to perform a series of tests on the data fetched from the remote service which are designed to detect, and issue warnings about, anomalies in the data. 
+
+Two broad types of anomaly can be monitored
+
+### Global anomalies
+
+These examine the whole dataset, whatever its dimensions (thought on terms of time_units x space_units x features). The available anomaly detectors are
+
+ - global_missingness: simply reports if the total fraction of missing (i.e. NaN) values across the whole dataset exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
+
+
+ - global zeros: reports if the total fraction of zero values across the whole dataset exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
+
+
+ - time_missingness: reports if the fraction of missingness across any (space_units x features) slices exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
+
+
+ - space_missingness: reports if the fraction of missingness across any (time_units x features) slices exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
+
+
+ - feature_missingness: reports if the fraction of missingness for any feature (over all time and space units) exceeds a threshold. Threshold should be a small number between 0 and 1, e.g. 0.05.
+
+
+ - time_zeros: reports if the fraction of zeros across any (space_units x features) slices exceeds a threshold. Threshold should be a number between 0 and 1 and close to 1, e.g. 0.95.
+
+
+ - space_zeros: reports if the fraction of zeros across any (time_units x features) slices exceeds a threshold. Threshold should be a number between 0 and 1 close to 1, e.g. 0.95.
+
+
+ - feature_zeros: reports if the fraction of zeros for any feature (over all time and space units) exceeds a threshold. Threshold should be a number between 0 and 1 close to 1, e.g. 0.95.
+
+
+### Recent data anomalies
+These partition the dataset into three partitions, defined by two integers n and m. If the most recent time unit in the dateset is k: the test partition consists of the most recent n time units, i.e. k-n+1 to k inclusive (usually n would be 1 so the test parition simply consists of the most recent time unit k), the standard partition consists of the most recent k-m-n to k-n time units. The time units before k-m-n are discarded. The available anomaly detectors are
+- delta_completeness: reports, for each feature, if the ratio of missingness fractions in the test and standard partitions is greater than a threshold. Threshold should be a number between 0 and 1, e.g. 0.25.
+
+
+- delta_zeros: reports, for each feature, if the ratio of the fraction of zeros in the test and standard partitions is greater than a threshold. Threshold should be a number between 0 and 1, e.g. 0.25.
+
+
+- extreme_values: reports, for each feature, if the most extreme value in the test partition is more than (threshold) standard deviations from the mean of the data in the test partition. Threshold should be a number in the range 2-7, e.g. 5.
+
+
+- ks_drift: for each feature, performs a two-sample Kolmogorov-Smirnoff test (https://en.wikipedia.org/wiki/Kolmogorov–Smirnov_test#Two-sample_Kolmogorov–Smirnov_test) between the data in the test and standard partitions and reports if (1/the returned p-value) exceeds a threshold. Threshold should be a large number, e.g. 100.
+
+
+- ecod_drift: for all features simultaneously, reports if the fraction of data-points considered outliers in the test partition exceeds that in the standard partition, according to an ECOD model (https://pyod.readthedocs.io/en/latest/_modules/pyod/models/ecod.html#ECOD) trained on the standard partition, exceeds a threshold. Threshold should be a number between 0 and 1, e.g. 0.25.
+
 ## Funding
 
 The contents of this repository is the outcome of projects that have received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (Grant agreement No. 694640, *ViEWS*) and Horizon Europe (Grant agreement No. 101055176, *ANTICIPATE*; and No. 101069312, *ViEWS* (ERC-2022-POC1)), Riksbankens Jubileumsfond (Grant agreement No. M21-0002, *Societies at Risk*), Uppsala University, Peace Research Institute Oslo, the United Nations Economic and Social Commission for Western Asia (*ViEWS-ESCWA*), the United Kingdom Foreign, Commonwealth & Development Office (GSRA – *Forecasting Fatalities in Armed Conflict*), the Swedish Research Council (*DEMSCORE*), the Swedish Foundation for Strategic Environmental Research (*MISTRA Geopolitics*), the Norwegian MFA (*Conflict Trends* QZA-18/0227), and the United Nations High Commissioner for Refugees (*the Sahel Predictive Analytics project*).
diff --git a/pyproject.toml b/pyproject.toml
@@ -1,6 +1,6 @@
 [tool.poetry]
 name = "viewser"
-version = "6.4.2"
+version = "6.5.0"
 description = "The Views 3 CLI tool"
 authors = ["peder2911 <pglandsverk@gmail.com>"]
 readme = "README.md"

diff --git a/viewser/commands/queryset/config_drift.py b/viewser/commands/queryset/config_drift.py
@@ -0,0 +1,65 @@
+import numpy as np
+from . import integrity_checks as ic
+
+default_dne = -np.inf
+default_missing = np.nan
+
+
+default_config_dict = {
+
+    'global_missingness':  {'threshold': 0.05,
+                            'test_function': ic.get_global_nan_fracs,
+                            'message': 'dataset missingness'},
+
+    'global_zeros':        {'threshold': 0.95,
+                            'test_function': ic.get_global_zero_fracs,
+                            'message': 'dataset zero'},
+
+    'time_missingness':    {'threshold': 0.01,
+                            'test_function': ic.get_time_nan_fracs,
+                            'message': 'time-unit missingness'},
+
+    'space_missingness':   {'threshold': 0.03,
+                            'test_function': ic.get_space_nan_fracs,
+                            'message': 'space-unit missingness'},
+
+    'feature_missingness': {'threshold': 0.01,
+                            'test_function': ic.get_feature_nan_fracs,
+                            'message': 'feature missingness'},
+
+    'time_zeros':          {'threshold': 0.95,
+                            'test_function': ic.get_time_zero_fracs,
+                            'message': 'time-unit zero'},
+
+    'space_zeros':         {'threshold': 0.95,
+                            'test_function': ic.get_space_zero_fracs,
+                            'message': 'space-unit zero'},
+
+    'feature_zeros':       {'threshold': 0.95,
+                            'test_function': ic.get_feature_zero_fracs,
+                            'message': 'feature zero'},
+
+    'delta_completeness':  {'threshold': 1.25,
+                            'test_function': ic.get_delta_completeness,
+                            'message': 'feature delta_completeness'},
+
+    'delta_zeroes':        {'threshold': 1.25,
+                            'test_function': ic.get_delta_zeroes,
+                            'message': 'feature delta_zeroes'},
+
+    'extreme_values':      {'threshold': 4.0,
+                            'test_function': ic.get_extreme_values,
+                            'message': 'feature extreme values'},
+
+    'ks_drift':            {'threshold': 100.,
+                            'test_function': ic.get_ks_drift,
+                            'message': 'feature KS drift'},
+
+    'ecod_drift':          {'threshold': 0.05,
+                            'test_function': ic.get_ecod_drift,
+                            'message': 'dataset ECOD drift'},
+
+    'standard_partition_length': 10,
+    'test_partition_length': 1
+
+    }
diff --git a/viewser/commands/queryset/drift_detection.py b/viewser/commands/queryset/drift_detection.py
@@ -0,0 +1,167 @@
+import numpy as np
+import scipy
+from views_tensor_utilities import objects, mappings
+from . import config_drift as config
+import datetime
+
+
+class InputAlarm:
+    def __repr__(self):
+        return f"Input alarm: {self.message} Severity: {self.severity} Timestamp: {self.timestamp}\n"
+
+    def __str__(self):
+        return f"Input alarm: {self.message} Severity: {self.severity} Timestamp: {self.timestamp}\n"
+
+    def __init__(self, message, severity=1):
+        self.message = message
+        self.severity = severity
+        self.timestamp = datetime.datetime.now().strftime("%Y-%m-%d %H:%M:%S")
+
+
+class Tester:
+
+    """
+    Tester
+
+    Class that mediates between the InputGate class and the testing functions. Brings together the relevant
+    test function, the partitions, the threshold, the message describing the test in the alarms, and the input data
+     - tensor, index, and feature names (required for useful reporting in the alarms)
+
+    """
+
+    def __init__(self, test_function=None,
+                 test_partition_length=1,
+                 standard_partition_length=1,
+                 threshold=0,
+                 message='',
+                 data=None,
+                 index=None,
+                 features=None,
+                 ):
+
+        self.test_function = test_function
+        self.test_partition_length = test_partition_length
+        self.standard_partition_length = standard_partition_length
+        self.threshold = threshold
+        self.message = message
+        self.data = data
+        self.index = index
+        self.features = features
+
+    def generate_alarms(self):
+
+        """
+        generate alarms
+
+        Calls the object's assigned testing function with a kwarg dict that the function picks and chooses from
+        as needed.
+        The function returns an array of results and a dictionary translating the indexes of the results in the
+        array into units of analysis or features for reporting in the alarms.
+
+        """
+
+        results, translation_dict = self.test_function(
+                                              tensor=self.data,
+                                              index=self.index,
+                                              features=self.features,
+                                              test_partition_length=self.test_partition_length,
+                                              standard_partition_length=self.standard_partition_length)
+
+        results /= self.threshold
+
+        try:
+            offenders = np.where(results > 1)[0]
+            severities = results[offenders]
+        except:
+
+            return None
+
+        if len(offenders) > 0:
+            alarms = []
+            for offender, severity in zip(offenders, severities):
+                if translation_dict is not None:
+                    offender_id = translation_dict[offender]
+                else:
+                    offender_id = offender
+
+                al = InputAlarm(
+                    f"{self.message}; offender: {offender_id}, "
+                    f"threshold: {self.threshold}",
+                    int(1+severity))
+                alarms.append(al)
+
+            return alarms
+        else:
+            return f"{self.message} passed"
+
+
+class InputGate:
+
+    """
+    InputGate
+
+    Class which superintends the input warning machinery. Accepts a dataframe containing the data to be examined
+    and a configuration dictionary which users can use to override the default settings in config_drift.
+
+    The df is converted to a tensor container and non-numeric parts of the data are stripped out.
+
+    """
+
+    def __init__(self, df, drift_config_dict=None):
+        self.config_dict = drift_config_dict
+        self.tensor_container = objects.ViewsDataframe(df).to_numpy_time_space()
+        self.numeric_part = self.tensor_container.get_numeric_part()
+        self.tensor = self.numeric_part.tensor
+        self.index = self.tensor_container.index
+        self.columns = self.numeric_part.columns
+
+        self.default_config_dict = config.default_config_dict
+        self.testers = []
+
+    def assemble_alerts(self):
+        """
+        assemble_alerts
+
+        Method which compares the default configuration dictionary with that supplied when the InputGate object is
+        instantiated, removes test functions that are not required and updates thresholds of those requested.
+
+        The resulting configuration dictionary is then used to generate a list of Tester objects, whose
+        generate_alarm methods are then called in sequence.
+
+        """
+
+        # override defaults
+        if self.config_dict is None:
+            self.config_dict = self.default_config_dict
+        else:
+            for key in self.default_config_dict.keys():
+                if key in self.config_dict.keys():
+                    try:
+                        detector_dict = self.default_config_dict[key]
+                        detector_dict['threshold'] = self.config_dict[key]['threshold']
+                        self.config_dict[key] = detector_dict
+                    except:
+                        pass
+                else:
+                    try:
+                        _ = self.default_config_dict[key]['threshold']
+                    except:
+                        self.config_dict[key] = self.default_config_dict[key]
+
+        testers = []
+        for key in self.config_dict.keys():
+            try:
+                tester_dict = self.config_dict[key]
+                testers.append(Tester(test_function=tester_dict['test_function'],
+                                      test_partition_length=self.config_dict['test_partition_length'],
+                                      standard_partition_length=self.config_dict['standard_partition_length'],
+                                      threshold=tester_dict['threshold'],
+                                      message=tester_dict['message'],
+                                      data=self.tensor,
+                                      index=self.index,
+                                      features=self.columns,
+                                      ))
+            except:
+                pass
+
+        return [tester.generate_alarms() for tester in testers]