Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handling sensitive data sent to remote services #646

Open
yoavkatz opened this issue Mar 11, 2024 · 1 comment
Open

Handling sensitive data sent to remote services #646

yoavkatz opened this issue Mar 11, 2024 · 1 comment

Comments

@yoavkatz
Copy link
Member

yoavkatz commented Mar 11, 2024

With the introductions of metrics that can send data to remote services - one needs a safe way to avoid accidentally sending propriety/confidential data to external services.

In the common case in unitxt metrics and datasets are developed by different people, who may not be aware of each other and the implementations, this becomes extremely error prone.

To address this we need a way for

  1. the dataset owner to specify for each dataset (or instance) the data classification. The taxonomy should be defined by the user.
  2. the metric owner should define whether the metric is safe for all data (e.g. running locally) or allow the user of the metric to specify which data classification are allowed to be used in the metric.

Suggested approach:
Each loader, will have an additional list[str] parameter called 'data_classification' . Different loaders can have difficult default. For example, LoadHF can be set the default to "public", while another Loader can set it to "propriety" . The user can override these for specific datasets , e.g. "PII".

loader=LoadFromIBMCloud(
        endpoint_url_env="MY_COS_URL",
        aws_access_key_id_env=MY_COS_ACCESS_KEY_ID",
        aws_secret_access_key_env="MY_COS_SECRET_ACCESS_KEY",
        bucket_name="...",
        data_dir=....",
        data_files=["train.jsonl", "test.jsonl"],
        data_classification=["propriety","pii"]
    ),

The loaders will add the list as a field to all the instances in the loaded datasets.

Each base metric class will check in the compute() function that all instance data classifications are allowed by
check_allowed_data_classification(instance) .

The default implementation of check_allowed_data_classification, will check a metric specific environment variable, for the list of allowed data classification.

If not, the an error message of this type will be generated.

"The following instance has data classification of '{instance_data_classification}', however the {metric} is only configured to support the following data with classification '{allowed_data_classification}.' To allow, this set the enviromment variable {env_var} to include '{instance_data_classification}',"

@elronbandel @eladven @perlitz - Please review.

@yoavkatz yoavkatz changed the title Handling data access in metrics Handling sensitive data sent to remote services Mar 11, 2024
@elronbandel
Copy link
Member

I agree @yoavkatz . This is a good solution.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants