Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"model_monitor_compute_histogram_buckets" Crashes When Getting Columns in Common #1983

Open
sraza-onshape opened this issue Dec 18, 2023 · 4 comments
Assignees
Labels
bug Something isn't working

Comments

@sraza-onshape
Copy link

sraza-onshape commented Dec 18, 2023

Steps to reproduce

  1. Create a data asset in Azure ML out of a CSV file. Use it for your training data - it should have 2 features and 1 target column (so, 3 columns in total).
  2. Train a supervised regression model on 1 of the features, so it can learn to predict the target.
  3. Deploy a production online endpoint - in the scoring script, implement custom logging to collect the data points that clients send in their requests.
  4. Implement a model monitor to compute data drift. Provide the data asset as your ReferenceData. Don't provide any argument for the ProductionData.

Expected behavior

When the pipeline runs, it should succeed.

Actual behavior

In the pipeline, we have an error where the DataDriftSignal is computed:

Screenshot 2023-12-18 at 2 44 19 PM

Within the "sub-pipeline", the error itself occurs in the node that does compute_histogram_buckets:

Screenshot 2023-12-18 at 2 46 01 PM

And this is the info provided by the stderrorlogs.txt:

[2023-12-18 16:11:48Z] Job failed, job RunId is 22010c80-6a8e-4544-8269-aebb277c2e92. 
Error: {
    "Error" : {
        "Code":"UserError",
        "Severity":null,
        "Message":"'NoneType' object has no attribute 'dtypes'",
        "MessageFormat":null,
        "MessageParameters":{},
        "ReferenceCode":null,
        "DetailsUri":null,
        "Target":null,
        "Details":[],
        "InnerError":null,
        "DebugInfo":{
            "Type":"AttributeError",
            "Message":"'NoneType' object has no attribute 'dtypes'",
            "StackTrace":"  
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/context_manager_injector.py\", line 243, in execute_with_context\n
                      runpy.run_path(sys.argv[0], globals(), run_name=\"__main__\")\n
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 289, in run_path\n
                        return _run_module_code(code, init_globals, run_name,\n
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 96, in _run_module_code\n
                        _run_code(code, mod_globals, init_globals,\n
                File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 86, in _run_code\n
                        exec(code, run_globals)\n
                File \"model_monitor_compute_histogram_buckets/run.py\", line 47, in <module>\n
                        run()\n
                File \"model_monitor_compute_histogram_buckets/run.py\", line 42, in run\n
                    histogram_buckets = compute_histogram_buckets(df1, df2)\n
                File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1702915399663_0002/container_1702915399663_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 51, in compute_histogram_buckets\n
                    bin_edges = compute_numerical_bins(df1, df2)\n
                File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1702915399663_0002/container_1702915399663_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 25, in compute_numerical_bins\n
                    common_columns_dict = get_common_columns(df1, df2)\n
                File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1702915399663_0002/container_1702915399663_0002_01_000001/source.zip/shared_utilities/df_utils.py\", line 93, in get_common_columns\n
                    production_df_dtypes = dict(production_df.dtypes)\n",
            "InnerException":null,
            "Data":null,
            "ErrorResponse":null
        },
        "AdditionalInfo":null
    },
    "Correlation":null,
    "Environment":null,
    "Location":null,
    "Time":"0001-01-01T00:00:00+00:00",
    "ComponentName":null
}

Addition information

Please let us know in case this bug is due to some error on our end, in terms of not understanding how to use Azure Machine Learning. For context, here are the tutorials we've used so far (for Steps 1-4) to try and learn the tool:

  1. Creating data assets
  2. Training models
  3. Deploying online endpoints and collecting inferencing data from them.
  4. Implementing advanced model monitoring
@sraza-onshape sraza-onshape added the bug Something isn't working label Dec 18, 2023
@VivienneTang
Copy link
Contributor

Hi, @sraza-onshape , you have this issue is because we are expecting both data set in production and reference data. Let us loop in our PM to decide if we will support the the case that production data is null

@VivienneTang
Copy link
Contributor

VivienneTang commented Dec 19, 2023

@sraza-onshape, Just to clarify, to compute the data drift, we have to have two datasets, one is baseline data set, the other is target data set. We need to compare the distribution of the training data (referred as baseline or reference) and target data (or production data) for data drift. So you have to provide production data in your case.

@sraza-onshape
Copy link
Author

Hi @VivienneTang and Team, thanks for the reply. I redefined the monitoring pipeline and it will run again soon.

In the meantime, this is the code we used to initialize ProductionData. I guess it's correct but if you have any feedback on it's being done correctly, please let me know. For context, we are currently collecting data sent to the production model in the default workspaceblobstore.

from azure.ai.ml import MLClient
from azure.ai.ml import Input
from azure.ai.ml.constants import (
    MonitorDatasetContext,
)
from azure.ai.ml.entities import (
    ProductionData,
)

ml_client = MLClient(...)

production_data_metadata = ml_client.datastores.get(name="workspaceblobstore")
production_data_metadata_dict = production_data_metadata._to_dict()
storage_uri = f"{production_data_metadata_dict['protocol']}://{production_data_metadata_dict['account_name']}.blob.{production_data_metadata_dict['endpoint']}/{production_data_metadata_dict['container_name']}"

production_data = ProductionData(
    input_data=Input(
        type="uri_folder",
        path=storage_uri,
    ),
    data_context=MonitorDatasetContext.MODEL_INPUTS,
)

@sraza-onshape
Copy link
Author

sraza-onshape commented Dec 29, 2023

Update: we have reimplemented our pipeline to include both ProductionData and ReferenceData, and we are still receiving errors.

  • so far I can confirm both data assets that are passed to the pipeline have no permissions issues, and are not empty
  • I removed the DataQualitySignal so we can zero in on just debugging the data drift. Here's what new pipeline definition looks like:
Screenshot 2023-12-29 at 1 21 55 PM

And then, this is the sub-pipeline:

Screenshot 2023-12-29 at 1 22 22 PM

The code for this pipeline is the following (I hope it reflects the changes stated in the previous bullets:

from azure.ai.ml.constants._monitoring import (
    MonitorFeatureDataType,
)

feature_dtype_spec = {
    "intermediary_V": MonitorFeatureDataType.NUMERICAL,
}

new_production_data = ProductionData(
    input_data=Input(
        type="uri_folder",
        path="azureml:<data_asset_name>:1",
    ),
    data_context=MonitorDatasetContext.MODEL_INPUTS,
)

monitoring_target = MonitoringTarget(
    ml_task="regression",
    # in general - this follows a pattern of azureml (see above)
    endpoint_deployment_id="azureml:<endpoint>:<deployment>"
)

new_training_data_asset = ml_client.data.get(
    name="<name>",
    version="<version>"
)

spark_compute = ServerlessSparkCompute(
    instance_type="standard_e4s_v3",
    runtime_version="3.2"
)

monitoring_target = monitoring_target  # from above

# training data to be used as baseline dataset
reference_data_training = ReferenceData(
    input_data=Input(
        type="mltable",  # note that is MUST == "mltable", even if the asset isn't technically that type 
                         # (in this case, the type is a "uri_file")
        # just updating the path here - the data is the same, but the previous asset
        # that was being used here got corrupted. It doesn't work
        # anymore in our "test" workspace b/c I deleted the job that created it
        path=f"azureml:{new_training_data_asset.name}:{new_training_data_asset.version}"
    ),
    data_context=MonitorDatasetContext.TRAINING,
    target_column_name="target_W",
)

# create an advanced data drift signal
features_list = ['intermediary_V']
metric_thresholds = DataDriftMetricThreshold(
    numerical=NumericalDriftMetrics(
        jensen_shannon_distance=0.01
    ),
    categorical=CategoricalDriftMetrics(
        pearsons_chi_squared_test=0.02
    )
)

advanced_data_drift = DataDriftSignal(
    production_data=new_production_data,
    reference_data=reference_data_training,
    features=features_list,
    metric_thresholds=metric_thresholds,
    feature_type_override=feature_dtype_spec,
)


# create an advanced data quality signal
metric_thresholds = DataQualityMetricThreshold(
    numerical=DataQualityMetricsNumerical(
        null_value_rate=0.01
    ),
    categorical=DataQualityMetricsCategorical(
        out_of_bounds_rate=0.02
    )
)

advanced_data_quality = DataQualitySignal(
    reference_data=reference_data_training,
    features=features_list,
    metric_thresholds=metric_thresholds,
    feature_type_override=feature_dtype_spec,
)


# put all monitoring signals in a dictionary
monitoring_signals = {
    'data_drift_advanced':advanced_data_drift,
    # 'data_quality_advanced':advanced_data_quality,  # commenting out for now, to avoid overscoping the experiment
}

# create alert notification object
alert_notification = AlertNotification(
    emails=['sraza@ptc.com']
)

# Finally monitor definition
monitor_definition = MonitorDefinition(
    compute=spark_compute,
    monitoring_target=monitoring_target,
    monitoring_signals=monitoring_signals,
    alert_notification=alert_notification
)
recurrence_trigger = RecurrenceTrigger(
    frequency="day",
    interval=1,
    schedule=RecurrencePattern(hours=3, minutes=15)
)

model_monitor_v7 = MonitorSchedule(
    name="project_ultron_model_monitoring_advanced",
    trigger=recurrence_trigger,
    create_monitor=monitor_definition
)

poller = ml_client.schedules.begin_create_or_update(model_monitor_v7)
created_monitor = poller.result()
  • And these are the error logs (I believe it's very similar to before):
[2023-12-29 16:23:18Z] Job failed, job RunId is 1611c9c1-8818-44a0-850f-c64d643616cf. 
Error: {
    "Error":{
        "Code":"UserError",
        "Severity":null,
        "Message":"'NoneType' object has no attribute 'dtypes'",
        "MessageFormat":null,
        "MessageParameters":{},
        "ReferenceCode":null,
        "DetailsUri":null,
        "Target":null,"
        Details":[],
        "InnerError":null,
        "DebugInfo":{
            "Type":"AttributeError",
            "Message":"'NoneType' object has no attribute 'dtypes'",
            "StackTrace":"  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/site-packages/context_manager_injector.py\", line 243, in execute_with_context\n    runpy.run_path(sys.argv[0], globals(), run_name=\"__main__\")\n  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 289, in run_path\n    return _run_module_code(code, init_globals, run_name,\n  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 96, in _run_module_code\n    _run_code(code, mod_globals, init_globals,\n  File \"/home/trusted-service-user/cluster-env/env/lib/python3.10/runpy.py\", line 86, in _run_code\n    exec(code, run_globals)\n  File \"model_monitor_compute_histogram_buckets/run.py\", line 47, in <module>\n    run()\n  File \"model_monitor_compute_histogram_buckets/run.py\", line 42, in run\n    histogram_buckets = compute_histogram_buckets(df1, df2)\n  File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1703866631910_0002/container_1703866631910_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 51, in compute_histogram_buckets\n    bin_edges = compute_numerical_bins(df1, df2)\n  File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1703866631910_0002/container_1703866631910_0002_01_000001/model_monitor_compute_histogram_buckets/histogram_buckets.py\", line 25, in compute_numerical_bins\n    common_columns_dict = get_common_columns(df1, df2)\n  File \"/mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1703866631910_0002/container_1703866631910_0002_01_000001/source.zip/shared_utilities/df_utils.py\", line 93, in get_common_columns\n    production_df_dtypes = dict(production_df.dtypes)\n",
            "InnerException":null,
            "Data":null,
            "ErrorResponse":null
        },
        "AdditionalInfo":null
    },
    "Correlation":null,
    "Environment":null,
    "Location":null,
    "Time":"0001-01-01T00:00:00+00:00",
    "ComponentName":null
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants