Skip to content

MotasemAlfarra/Online_Test_Time_Adaptation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

63 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Revisiting Test Time Adaptation Under Online Evaluation

plot

Accepted in the International Conference on Machine Learning (ICML 2024)

Preprint: Link_To_Paper

This benchmark is a step towards standardizing the evaluation of Test Time Adaptation (TTA) methods. We have implementations of 14 different TTA methods from the literature. The following table reports the average episodic error rate (%) of the implemented methods under the offline and online evaluation schemes on ImageNet-C.

Method Venue Paper Code Offline Eval. (%) Online Eval. (%)
ETA / EATA ICML'22 (paper) (code) 52.0 55.6
SHOT / SHOT-IM ICML'20 (paper) (code) 59.9 59.1
TENT ICLR'21 (paper) (code) 57.3 61.6
SAR ICLR'23 (paper) (code) 56.2 63.4
PL ICMLW'13 (paper) (code) 65.0 65.3
TTAC-NQ NeurIPS'22 (paper) (code) 59.0 66.5
BN Adaptation NeurIPS'20 (paper) (code) 66.7 66.7
CoTTA CVPR'22 (paper) (code) 61.5 68.0
AdaBN ICLR'17 (paper) (code) 68.5 68.5
MEMO NeurIPS'22 (paper) (code) 76.3 81.9
DDA CVPR'23 (paper) (code) 64.4 82.0
Source - (paper) (code) 82.0 82.0
LAME CVPR'22 (paper) (code) 82.7 82.7

We fixed the architecture to ResNet-50 throughout all our experimetns and used the torchvision pretrained weights.

Environment Installation

To use our code, first you might need to install our environment through running:

conda env install -f environment.yml

Datasets used for Evaluation

Our results are reported on 3 different datasets: ImageNet-C, ImageNet-R, and ImageNet-3DCC. All datasets are publicly available and can be downloaded from their corresponding repositories.

For ImageNet-C and ImageNet-3DCC, the data should be organized as PATH/COURRUPTION/SEVERITY/*.

Online Evaluation of TTA Methods

Our paper evaluates the efficacy of TTA methods when data arrives as a stream with constant speed. We simulate that by assuming that the rate in which the stream reveals new data is $\eta * r$ where $r$ is the speed of the forward pass of non-adapted model and $\eta \in [0, 1]$. Hence, as $\eta \rightarrow 0$, then all TTA methods will adapt to all revealed samples as the stream is revealing data in a very small rate. As $\eta \rightarrow 1$, then the stream is revealing data in a fast rate penalizing slow TTA methods by allowing them to adapt on fewer samples.

Evaluating TTA Methods

We considered two different evaluation schemes in our work: episodic evaluation and continual evaluation. Episodic evaluation evaluates a given TTA method on a single domain shift, e.g. one corruption. Continual evaluation evaluates a given TTA method on a sequence of domain shifts continually without resetting the parameters of the model. At last, we also considered single model evaluation. In this setup, a random prediction is assigned to all missed batches that TTA methods did not adapt to.

Episodic Evaluation

To evaluate a TTA method under different stream speeds, run:

python main.py --eta [ETA] --method [METHOD] --dataset [DATASET] --corruption [CORRUPTION] --level [LEVEL] --imagenetc_path [PATH] --batch_size [BATCH_SIZE] --output [OUTPUT_PATH]

where

  • ETA: is a float between 0 and 1 representing $\eta$ in our paper for varying the stream speed. Default value is $\eta = 1$ which corresponds to online evaluation.
  • METHOD: is a TTA method which should belong to ['basic', 'tent', 'eta', 'eata', 'cotta', 'ttac_nq', 'memo', 'adabn', 'shot', 'shotim', 'lame', 'bn_adaptation', 'pl', 'sar', 'dda'].
  • DATASET: should belong to [imagenetc, imagenetr, imagenet3dcc].
  • CORRUPTION: is the type of corruption you would like to evaluate on.
    • ImageNet-C corruptions: ['gaussian_noise', 'shot_noise', 'impulse_noise', 'defocus_blur', 'glass_blur', 'motion_blur', 'zoom_blur', 'snow', 'frost', 'fog', 'brightness', 'contrast', 'elastic_transform', 'pixelate', 'jpeg_compression'].

    • ImageNet-3DCC corruptions: ['bit_error', 'color_quant', 'far_focus', 'flash', 'fog_3d', 'h265_abr', 'h265_crf', 'iso_noise', 'low_light', 'near_focus', 'xy_motion_blur', 'z_motion_blur'].

    • For ImageNet-R, do not pass the --corruption.

  • LEVEL: is an integer between 1 and 5 to determine how severe the corruption is. All our results are done with a severity of 5 (default value).
  • PATH: is the path for for ImageNet-C dataset. The data should be in the format PATH/COURRUPTION/SEVERITY/*. If you are evaluating on ImageNet-3DCC or ImageNet-R, then replace --imagenetc_path with --imagenet3dcc_path or --imagenetr_path.
  • BATCH_SIZE: is the batch size of the validation loader. For all of our experiments, we fixed the batch size to 64.
  • OUTPUT: is the output path to save the results of the evaluation. The output of the code is OUTPUT/DATASET/METHOD/eta_ETA/CORRUPTION.txt that reports both $\eta$ and the error rate.

Continual Evaluation

To test a given TTA method under a continual sequence of domain shifts, run:

python main.py --exp_type continual --test_val --eta [ETA] --method [METHOD] --dataset [DATASET] --corruption [CORRUPTION] --level [LEVEL] --imagenetc_path [PATH] --batch_size [BATCH_SIZE] --output [OUTPUT_PATH]

Note that the main difference is passing --exp_type continual.

  • CORRUPTION: should belong to ['all', 'all_ordered'] where all_ordered sets the order of the corruptions similar to the one in Section 4.3 (Figure 3), and all shuffles all corruptions randomly.
  • --test_val: To evaluate on the clean validation set of ImageNet at the end of the continual evaluation.

All the remaining arguments follow our episodic evaluation.

Single Model Experiments

To test a given TTA method in a single model evaluation scheme, following Section 4.6, run:

python main.py --single_model --eta [ETA] --method [METHOD] --dataset [DATASET] --corruption [CORRUPTION] --level [LEVEL] --imagenetc_path [PATH] --output [OUTPUT_PATH] --batch_size [BATCH_SIZE]

where all other arguments follow our episodic evaluation.

Adding New TTA Methods

To add additional TTA methods, please follow the example in our basic wrapper tta_methods/basic.py. Note that each TTA method is required to have the non-adapted forward pass as the property self.model. This property will allow the online evaluation to pass batches that will not be adapted to the normal forward pass. After adding your new method in tta_methods directory, please import it in tta_methods/__init__.py and add it to the _all_methods dictionary. To test the efficacy of the new implemented method in the episodic evaluation scheme, run:

python main.py --eta [ETA] --method [METHOD] --dataset [DATASET] --corruption [CORRUPTION] --level [LEVEL] --imagenetc_path [PATH] --batch_size [BATCH_SIZE] --output [OUTPUT_PATH]

where [METHOD] should be the added key in the _all_methods dictionary.

Citation

If you find our work useful, please consider citing our paper:

@misc{alfarra2023revisiting,
      title={Revisiting Test Time Adaptation under Online Evaluation}, 
      author={Motasem Alfarra and Hani Itani and Alejandro Pardo and Shyma Alhuwaider and Merey Ramazanova and Juan C. Pérez and Zhipeng Cai and Matthias Müller and Bernard Ghanem},
      year={2023},
      eprint={2304.04795},
      archivePrefix={arXiv},
      primaryClass={cs.LG}
}