You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
With the introductions of metrics that can send data to remote services - one needs a safe way to avoid accidentally sending propriety/confidential data to external services.
In the common case in unitxt metrics and datasets are developed by different people, who may not be aware of each other and the implementations, this becomes extremely error prone.
To address this we need a way for
the dataset owner to specify for each dataset (or instance) the data classification. The taxonomy should be defined by the user.
the metric owner should define whether the metric is safe for all data (e.g. running locally) or allow the user of the metric to specify which data classification are allowed to be used in the metric.
Suggested approach:
Each loader, will have an additional list[str] parameter called 'data_classification' . Different loaders can have difficult default. For example, LoadHF can be set the default to "public", while another Loader can set it to "propriety" . The user can override these for specific datasets , e.g. "PII".
The loaders will add the list as a field to all the instances in the loaded datasets.
Each base metric class will check in the compute() function that all instance data classifications are allowed by
check_allowed_data_classification(instance) .
The default implementation of check_allowed_data_classification, will check a metric specific environment variable, for the list of allowed data classification.
If not, the an error message of this type will be generated.
"The following instance has data classification of '{instance_data_classification}', however the {metric} is only configured to support the following data with classification '{allowed_data_classification}.' To allow, this set the enviromment variable {env_var} to include '{instance_data_classification}',"
With the introductions of metrics that can send data to remote services - one needs a safe way to avoid accidentally sending propriety/confidential data to external services.
In the common case in unitxt metrics and datasets are developed by different people, who may not be aware of each other and the implementations, this becomes extremely error prone.
To address this we need a way for
Suggested approach:
Each loader, will have an additional list[str] parameter called 'data_classification' . Different loaders can have difficult default. For example, LoadHF can be set the default to "public", while another Loader can set it to "propriety" . The user can override these for specific datasets , e.g. "PII".
The loaders will add the list as a field to all the instances in the loaded datasets.
Each base metric class will check in the compute() function that all instance data classifications are allowed by
check_allowed_data_classification(instance) .
The default implementation of check_allowed_data_classification, will check a metric specific environment variable, for the list of allowed data classification.
If not, the an error message of this type will be generated.
"The following instance has data classification of '{instance_data_classification}', however the {metric} is only configured to support the following data with classification '{allowed_data_classification}.' To allow, this set the enviromment variable {env_var} to include '{instance_data_classification}',"
@elronbandel @eladven @perlitz - Please review.
The text was updated successfully, but these errors were encountered: