Handling sensitive data sent to remote services #646

yoavkatz · 2024-03-11T07:48:10Z

With the introductions of metrics that can send data to remote services - one needs a safe way to avoid accidentally sending propriety/confidential data to external services.

In the common case in unitxt metrics and datasets are developed by different people, who may not be aware of each other and the implementations, this becomes extremely error prone.

To address this we need a way for

the dataset owner to specify for each dataset (or instance) the data classification. The taxonomy should be defined by the user.
the metric owner should define whether the metric is safe for all data (e.g. running locally) or allow the user of the metric to specify which data classification are allowed to be used in the metric.

Suggested approach:
Each loader, will have an additional list[str] parameter called 'data_classification' . Different loaders can have difficult default. For example, LoadHF can be set the default to "public", while another Loader can set it to "propriety" . The user can override these for specific datasets , e.g. "PII".

loader=LoadFromIBMCloud(
        endpoint_url_env="MY_COS_URL",
        aws_access_key_id_env=MY_COS_ACCESS_KEY_ID",
        aws_secret_access_key_env="MY_COS_SECRET_ACCESS_KEY",
        bucket_name="...",
        data_dir=....",
        data_files=["train.jsonl", "test.jsonl"],
        data_classification=["propriety","pii"]
    ),

The loaders will add the list as a field to all the instances in the loaded datasets.

Each base metric class will check in the compute() function that all instance data classifications are allowed by
check_allowed_data_classification(instance) .

The default implementation of check_allowed_data_classification, will check a metric specific environment variable, for the list of allowed data classification.

If not, the an error message of this type will be generated.

"The following instance has data classification of '{instance_data_classification}', however the {metric} is only configured to support the following data with classification '{allowed_data_classification}.' To allow, this set the enviromment variable {env_var} to include '{instance_data_classification}',"

@elronbandel @eladven @perlitz - Please review.

The text was updated successfully, but these errors were encountered:

elronbandel · 2024-03-13T07:53:01Z

I agree @yoavkatz . This is a good solution.

yoavkatz changed the title ~~Handling data access in metrics~~ Handling sensitive data sent to remote services Mar 11, 2024

pawelknes mentioned this issue May 8, 2024

Support for handling sensitive data sent to remote services #806

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Handling sensitive data sent to remote services #646

Handling sensitive data sent to remote services #646

yoavkatz commented Mar 11, 2024 •

edited

elronbandel commented Mar 13, 2024

Handling sensitive data sent to remote services #646

Handling sensitive data sent to remote services #646

Comments

yoavkatz commented Mar 11, 2024 • edited

elronbandel commented Mar 13, 2024

yoavkatz commented Mar 11, 2024 •

edited