Distance matters for improving performance estimation under covariate shift

Mélanie Roschewitz & Ben Glocker.
Accepted at ICCV - Workshop on Uncertainty Quantification for Computer Vision 2023.

If you like this repository, please consider citing our work

@inproceedings{roschewitz2023distance,
  title={Distance Matters For Improving Performance Estimation Under Covariate Shift},
  author={Roschewitz, M{\'e}lanie and Glocker, Ben},
  booktitle={Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops},
  pages={4549--4559},
  year={2023}
}

Abstract Performance estimation under covariate shift is a crucial component of safe AI model deployment, especially for sensitive use-cases. Recently, several solutions were proposed to tackle this problem, most leveraging model predictions or softmax confidence to derive accuracy estimates. However, under dataset shifts confidence scores may become ill-calibrated if samples are too far from the training distribution. In this work, we show that taking into account distances of test samples to their expected training distribution can significantly improve performance estimation under covariate shift. Precisely, we introduce a "distance-check" to flag samples that lie too far from the expected distribution, to avoid relying on their untrustworthy model outputs in the accuracy estimation step. We demonstrate the effectiveness of this method on 13 image classification tasks, across a wide-range of natural and synthetic distribution shifts and hundreds of models.

This repository contains all the necessary code to reproduce our model evaluation, training and plots. Paper can be found here.

Overview

The repository is divided into the following sub-folders:

evaluation contains the most important part of this codebase, defining all necessary tools for accuracy estimation. In particular:
- evaluation_confidence_based.py main entry point for running our experiments and reproduce the results in Table 1 (see example below).
- plotting_notebook.ipynb contains all plotting code and evaluation code to fill in the results tables.
- evaluation_gde_based.py to reproduce the GDE versus GDE-DistCS experiments.
- distance_checker.py main file building our proposed distance checker.
- confidence_estimates.py main file defining ATC, DoC baselines including code for temperature scaling (both global and class-wise).
- inference_utils.py all functions necessary to gather model outputs
- tsne_analysis.ipynb to reproduce our TSNE plot (Fig. 2)
- In the ablation_studies folder, you will find the script to run our ablations on K-NN hyperparameters and threshold choices.
classification contains all the necessary code to train and define models, as well as all the code to load specific experimental configurations. The configs/general subfolder contains all training configuration used in this work. Our code is uses PyTorch Lightning and the main classification module is defined in classification_module.py. The main entry point for training models is train_all_models_for_dataset.py to train all models used in the paper for a given task. All the outputs will be placed in [REPO_ROOT] / outputs by default.
data_handling contains all the code related to data loading and augmentations.

Prerequisites

Start by cloning our conda environment as specified by the environment_full.yml file as the root of the repository.
Make sure you update the paths to your datasets in default_paths.py.
Make sure the root directory is in your PYTHONPATH environment variable.

Ready to go!

Step-by-step example

In this section, we will walk you through all steps necessary to reproduce the experiments for Living17. The procedure is identical for all other experiments, you just need to change which dataset you want to use.

Assuming your current work directory is the root of the repository:

Train all models (this will take a few days!) for this dataset python classification/train_all_models_for_dataset.py --dataset living17.
You are ready to run the evaluation benchmark with python evaluation/evaluation_confidence_based.py --dataset living17
The outputs can be found in the outputs/{DATASET_NAME}/{MODEL_NAME}/{RUN_NAME} folder. There you will find metrics.csv which contains all predictions and errors for all models for this dataset.
If you then want to reproduce the plots and the aggregated results over all models as in Table 1 in paper, you will need run the evaluation/plotting_notebook.ipynb notebook.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
classification		classification
data_handling		data_handling
evaluation		evaluation
LICENSE		LICENSE
README.md		README.md
ci_light_environment.yml		ci_light_environment.yml
default_paths.py		default_paths.py
figure1.png		figure1.png
full_environment.yml		full_environment.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

classification

classification

data_handling

data_handling

evaluation

evaluation

LICENSE

LICENSE

README.md

README.md

ci_light_environment.yml

ci_light_environment.yml

default_paths.py

default_paths.py

figure1.png

figure1.png

full_environment.yml

full_environment.yml

Repository files navigation

Distance matters for improving performance estimation under covariate shift

Overview

Prerequisites

Step-by-step example

About

Languages

License

melanibe/distance_matters_performance_estimation

Folders and files

Latest commit

History

Repository files navigation

Overview

Prerequisites

Step-by-step example

About

Topics

Resources

License

Stars

Watchers

Forks

Languages