"model_monitor_compute_histogram_buckets" Crashes When Getting Columns in Common #1983

sraza-onshape · 2023-12-18T19:56:31Z

Steps to reproduce

Create a data asset in Azure ML out of a CSV file. Use it for your training data - it should have 2 features and 1 target column (so, 3 columns in total).
Train a supervised regression model on 1 of the features, so it can learn to predict the target.
Deploy a production online endpoint - in the scoring script, implement custom logging to collect the data points that clients send in their requests.
Implement a model monitor to compute data drift. Provide the data asset as your ReferenceData. Don't provide any argument for the ProductionData.

Expected behavior

When the pipeline runs, it should succeed.

Actual behavior

In the pipeline, we have an error where the DataDriftSignal is computed:

Within the "sub-pipeline", the error itself occurs in the node that does compute_histogram_buckets:

And this is the info provided by the stderrorlogs.txt:

[2023-12-18 16:11:48Z] Job failed, job RunId is 22010c80-6a8e-4544-8269-aebb277c2e92. 
Error: {
    "Error" : {
        "Code":"UserError",
        "Severity":null,
        "Message":"'NoneType' object has no attribute 'dtypes'",
        "MessageFormat":null,
        "MessageParameters":{},
        "ReferenceCode":null,
        "DetailsUri":null,
        "Target":null,
        "Details":[],
        "InnerError":null,
        "DebugInfo":{
            "Type":"AttributeError",
            "Message":"'NoneType' object has no attribute 'dtypes'",
            "StackTrace":"  
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/context_manager_injector.py\", line 243, in execute_with_context\n
                      runpy.run_path(sys.argv[0], globals(), run_name=\"__main__\")\n
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 289, in run_path\n
                        return _run_module_code(code, init_globals, run_name,\n
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 96, in _run_module_code\n
                        _run_code(code, mod_globals, init_globals,\n
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 86, in _run_code\n
                        exec(code, run_globals)\n
                File \"model_monitor_compute_histogram_buckets/run.py\", line 47, in <module>\n
                        run()\n
                File \"model_monitor_compute_histogram_buckets/run.py\", line 42, in run\n
                    histogram_buckets = compute_histogram_buckets(df1, df2)\n
                File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1702915399663_0002/container_1702915399663_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 51, in compute_histogram_buckets\n
                    bin_edges = compute_numerical_bins(df1, df2)\n
                File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1702915399663_0002/container_1702915399663_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 25, in compute_numerical_bins\n
                    common_columns_dict = get_common_columns(df1, df2)\n
                File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1702915399663_0002/container_1702915399663_0002_01_000001/source.zip/shared_utilities/df_utils.py\", line 93, in get_common_columns\n
                    production_df_dtypes = dict(production_df.dtypes)\n",
            "InnerException":null,
            "Data":null,
            "ErrorResponse":null
        },
        "AdditionalInfo":null
    },
    "Correlation":null,
    "Environment":null,
    "Location":null,
    "Time":"0001-01-01T00:00:00+00:00",
    "ComponentName":null
}

Addition information

Please let us know in case this bug is due to some error on our end, in terms of not understanding how to use Azure Machine Learning. For context, here are the tutorials we've used so far (for Steps 1-4) to try and learn the tool:

The text was updated successfully, but these errors were encountered:

VivienneTang · 2023-12-19T18:13:03Z

Hi, @sraza-onshape , you have this issue is because we are expecting both data set in production and reference data. Let us loop in our PM to decide if we will support the the case that production data is null

VivienneTang · 2023-12-19T18:28:48Z

@sraza-onshape, Just to clarify, to compute the data drift, we have to have two datasets, one is baseline data set, the other is target data set. We need to compare the distribution of the training data (referred as baseline or reference) and target data (or production data) for data drift. So you have to provide production data in your case.

sraza-onshape · 2023-12-27T16:43:06Z

Hi @VivienneTang and Team, thanks for the reply. I redefined the monitoring pipeline and it will run again soon.

In the meantime, this is the code we used to initialize ProductionData. I guess it's correct but if you have any feedback on it's being done correctly, please let me know. For context, we are currently collecting data sent to the production model in the default workspaceblobstore.

from azure.ai.ml import MLClient
from azure.ai.ml import Input
from azure.ai.ml.constants import (
    MonitorDatasetContext,
)
from azure.ai.ml.entities import (
    ProductionData,
)

ml_client = MLClient(...)

production_data_metadata = ml_client.datastores.get(name="workspaceblobstore")
production_data_metadata_dict = production_data_metadata._to_dict()
storage_uri = f"{production_data_metadata_dict['protocol']}://{production_data_metadata_dict['account_name']}.blob.{production_data_metadata_dict['endpoint']}/{production_data_metadata_dict['container_name']}"

production_data = ProductionData(
    input_data=Input(
        type="uri_folder",
        path=storage_uri,
    ),
    data_context=MonitorDatasetContext.MODEL_INPUTS,
)

sraza-onshape · 2023-12-29T18:27:29Z

Update: we have reimplemented our pipeline to include both ProductionData and ReferenceData, and we are still receiving errors.

so far I can confirm both data assets that are passed to the pipeline have no permissions issues, and are not empty
I removed the DataQualitySignal so we can zero in on just debugging the data drift. Here's what new pipeline definition looks like:

And then, this is the sub-pipeline:

The code for this pipeline is the following (I hope it reflects the changes stated in the previous bullets:

from azure.ai.ml.constants._monitoring import (
    MonitorFeatureDataType,
)

feature_dtype_spec = {
    "intermediary_V": MonitorFeatureDataType.NUMERICAL,
}

new_production_data = ProductionData(
    input_data=Input(
        type="uri_folder",
        path="azureml:<data_asset_name>:1",
    ),
    data_context=MonitorDatasetContext.MODEL_INPUTS,
)

monitoring_target = MonitoringTarget(
    ml_task="regression",
    # in general - this follows a pattern of azureml (see above)
    endpoint_deployment_id="azureml:<endpoint>:<deployment>"
)

new_training_data_asset = ml_client.data.get(
    name="<name>",
    version="<version>"
)

spark_compute = ServerlessSparkCompute(
    instance_type="standard_e4s_v3",
    runtime_version="3.2"
)

monitoring_target = monitoring_target  # from above

# training data to be used as baseline dataset
reference_data_training = ReferenceData(
    input_data=Input(
        type="mltable",  # note that is MUST == "mltable", even if the asset isn't technically that type 
                         # (in this case, the type is a "uri_file")
        # just updating the path here - the data is the same, but the previous asset
        # that was being used here got corrupted. It doesn't work
        # anymore in our "test" workspace b/c I deleted the job that created it
        path=f"azureml:{new_training_data_asset.name}:{new_training_data_asset.version}"
    ),
    data_context=MonitorDatasetContext.TRAINING,
    target_column_name="target_W",
)

# create an advanced data drift signal
features_list = ['intermediary_V']
metric_thresholds = DataDriftMetricThreshold(
    numerical=NumericalDriftMetrics(
        jensen_shannon_distance=0.01
    ),
    categorical=CategoricalDriftMetrics(
        pearsons_chi_squared_test=0.02
    )
)

advanced_data_drift = DataDriftSignal(
    production_data=new_production_data,
    reference_data=reference_data_training,
    features=features_list,
    metric_thresholds=metric_thresholds,
    feature_type_override=feature_dtype_spec,
)


# create an advanced data quality signal
metric_thresholds = DataQualityMetricThreshold(
    numerical=DataQualityMetricsNumerical(
        null_value_rate=0.01
    ),
    categorical=DataQualityMetricsCategorical(
        out_of_bounds_rate=0.02
    )
)

advanced_data_quality = DataQualitySignal(
    reference_data=reference_data_training,
    features=features_list,
    metric_thresholds=metric_thresholds,
    feature_type_override=feature_dtype_spec,
)


# put all monitoring signals in a dictionary
monitoring_signals = {
    'data_drift_advanced':advanced_data_drift,
    # 'data_quality_advanced':advanced_data_quality,  # commenting out for now, to avoid overscoping the experiment
}

# create alert notification object
alert_notification = AlertNotification(
    emails=['sraza@ptc.com']
)

# Finally monitor definition
monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals=monitoring_signals,
    alert_notification=alert_notification
)
recurrence_trigger = RecurrenceTrigger(
    frequency="day",
    interval=1,
    schedule=RecurrencePattern(hours=3, minutes=15)
)

model_monitor_v7 = MonitorSchedule(
    name="project_ultron_model_monitoring_advanced",
    trigger=recurrence_trigger,
    create_monitor=monitor_definition
)

poller = ml_client.schedules.begin_create_or_update(model_monitor_v7)
created_monitor = poller.result()

And these are the error logs (I believe it's very similar to before):

[2023-12-29 16:23:18Z] Job failed, job RunId is 1611c9c1-8818-44a0-850f-c64d643616cf. 
Error: {
    "Error":{
        "Code":"UserError",
        "Severity":null,
        "Message":"'NoneType' object has no attribute 'dtypes'",
        "MessageFormat":null,
        "MessageParameters":{},
        "ReferenceCode":null,
        "DetailsUri":null,
        "Target":null,"
        Details":[],
        "InnerError":null,
        "DebugInfo":{
            "Type":"AttributeError",
            "Message":"'NoneType' object has no attribute 'dtypes'",
            "StackTrace":"  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/context_manager_injector.py\", line 243, in execute_with_context\n    runpy.run_path(sys.argv[0], globals(), run_name=\"__main__\")\n  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 289, in run_path\n    return _run_module_code(code, init_globals, run_name,\n  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 96, in _run_module_code\n    _run_code(code, mod_globals, init_globals,\n  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code, run_globals)\n  File \"model_monitor_compute_histogram_buckets/run.py\", line 47, in <module>\n    run()\n  File \"model_monitor_compute_histogram_buckets/run.py\", line 42, in run\n    histogram_buckets = compute_histogram_buckets(df1, df2)\n  File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1703866631910_0002/container_1703866631910_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 51, in compute_histogram_buckets\n    bin_edges = compute_numerical_bins(df1, df2)\n  File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1703866631910_0002/container_1703866631910_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 25, in compute_numerical_bins\n    common_columns_dict = get_common_columns(df1, df2)\n  File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1703866631910_0002/container_1703866631910_0002_01_000001/source.zip/shared_utilities/df_utils.py\", line 93, in get_common_columns\n    production_df_dtypes = dict(production_df.dtypes)\n",
            "InnerException":null,
            "Data":null,
            "ErrorResponse":null
        },
        "AdditionalInfo":null
    },
    "Correlation":null,
    "Environment":null,
    "Location":null,
    "Time":"0001-01-01T00:00:00+00:00",
    "ComponentName":null
}

sraza-onshape added the bug Something isn't working label Dec 18, 2023

vizhur assigned VivienneTang, shreeyaharma and mastloui-msft Dec 19, 2023

sraza-onshape mentioned this issue Dec 27, 2023

DataDriftSignal should raise a helpful error message if it is not provided with a ProductionData instance Azure/azure-sdk-for-python#33665

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"model_monitor_compute_histogram_buckets" Crashes When Getting Columns in Common #1983

"model_monitor_compute_histogram_buckets" Crashes When Getting Columns in Common #1983

sraza-onshape commented Dec 18, 2023 •

edited

VivienneTang commented Dec 19, 2023

VivienneTang commented Dec 19, 2023 •

edited

sraza-onshape commented Dec 27, 2023

sraza-onshape commented Dec 29, 2023 •

edited

"model_monitor_compute_histogram_buckets" Crashes When Getting Columns in Common #1983

"model_monitor_compute_histogram_buckets" Crashes When Getting Columns in Common #1983

Comments

sraza-onshape commented Dec 18, 2023 • edited

Steps to reproduce

Expected behavior

Actual behavior

Addition information

VivienneTang commented Dec 19, 2023

VivienneTang commented Dec 19, 2023 • edited

sraza-onshape commented Dec 27, 2023

sraza-onshape commented Dec 29, 2023 • edited

sraza-onshape commented Dec 18, 2023 •

edited

VivienneTang commented Dec 19, 2023 •

edited

sraza-onshape commented Dec 29, 2023 •

edited