Skip to content

Empirical study that reports the correlation between complexity measures and generalization performance by deep learning models on medical images

Notifications You must be signed in to change notification settings

avakanski/Evaluation-of-Complexity-Measures-for-Deep-Learning-Generalization-in-Medical-Image-Analysis

Repository files navigation

Evaluation-of-Complexity-Measures-for-Deep-Learning-Generalization-in-Medical-Image-Analysis

arXiv

The codes in this repository are based on our empirical study that investigates the correlation between complexity measures and generalization abilities of supervised deep learning classifiers for breast ultrasound images. The study is presented in the paper Evaluation of Complexity Measures for Deep Learning Generalization in Medical Image Analysis.

The performance of deep learning models for medical image analysis often decreases on images collected with different devices for data acquisition, device settings, or patient population. A better understanding of the generalization capacity on new images is crucial for clinicians’ trustworthiness in deep learning. Although significant research efforts have been recently directed toward establishing generalization bounds and complexity measures, still, there is often a significant discrepancy between the predicted and actual generalization performance. In addition, related large empirical studies (e.g., Jiang et al. (2019), Dziugaite et al. (2020)) have been primarily based on validation with general-purpose image datasets.

In our empirical study, we evaluate the correlation between 25 complexity measures (adopted from Dziugaite et al. (2020)) and generalization behavior of a family of deep learning networks on breast ultrasound images, using two types of predictor tasks: (i) classification, and (ii) joint classification and segmentation. In controlled experiment settings, we vary the depth of the networks to analyze generalization performance. The results indicate that PAC-Bayes flatness-based and path norm-based measures produce the most consistent explanation for the combination of models and data. Furthermore, the comparative results show improved generalization by the multi-task approach on both i.i.d. and o.o.d. images.

Codes

The following Jupyter notebooks have the deep learning models implemented in PyTorch. The implementation of complexity measures is from https://github.com/nitarshan/robust-generalization-measures.

The implementation in this repository presents the experimental procedure, but it is different from the empirical study in the paper, which employs families of trained models with different architectures for evaluating the correlation between the complexity measures and generalization behavior.

The dataset is organized as follows:

  • data/images – folder with breast ultrasound images.
  • data/masks – folder with the corresponding segmentation masks, used for multi-task learning.
  • data/labels - excel file with labels for the tumor type in images (benign or malignant).

Complexity Measures

The following set of complexity measures is evaluated:

  • VC dimension-based measure - number of network parameters.
  • Output-based measure - inverse of the squared margin of output logits.
  • Spectral norm-based measures - seven measures are used, derived using sums, products, and margin-normalized spectral norms of the network parameters.
  • Frobenius norm-based measures - seven measures are used that similarly to the spectral measures are derived using Frobenius norms of the network parameters.
  • Path-based measures - two measures are reported, calculated as the outputs to all-one inputs with squared parameters.
  • Flatness-based measures - are based on PAC-Bayes theory and estimate the flatness of the loss landscape in the vicinity of the solution for the network parameters. Six complexity measures are used.
  • Optimization-based measure - employs the number of iterations for achieving a classification error of 0.01 or 0.1.

Network Architecture

The architecture of the evaluated networks is shown in the following figure. Single-task learning models employ a VGG-like classification branch which consists of a series of blocks with convolutional and max-pooling layers, followed by fully-connected layers. Multi-task learning models perform joint classification and segmentation, by adding a U-Net-like decoder to the encoder of the classification branch.

Network Architecture

Citation

If you use the codes in this repository, please cite the following article:

@INPROCEEDINGS{9596501,
title={Evaluation of Complexity Measures for Deep Learning Generalization in Medical Image Analysis},
author={Vakanski, Aleksandar and Xian, Min},
booktitle={2021 IEEE 31st International Workshop on Machine Learning for Signal Processing (MLSP)},
year={2021},
month={October},
pages={1-6},
doi={10.1109/MLSP52302.2021.9596501}
}

License

MIT License

Acknowledgments

This work was supported by the Institute for Modeling Collaboration and Innovation (IMCI) at the University of Idaho through NIH Award #P20GM104420.

About

Empirical study that reports the correlation between complexity measures and generalization performance by deep learning models on medical images

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published