Image Segmentation with Encoder Decoders

This repository contains the documentation and source code for experiments with encoder decoder models for image segmentation on different datasets.

Team members

Ehsan Yousefzadeh-Asl-Miandoab
Marcel Thomas Rosier
Andreas Møller Belsager

Background

Computer vision is a sub-field in computer science field aiming at gathering, processing, analyzing, and understanding digital images and extracting information that can be used by numerical decision-making processes.

Definition

The process of partitioning a digital image into multiple image segments (image regions, image objects), which indeed are sets of pixels. In other words, image segmentation can be viewed as pixel labeling too.

Goal

Simplify the representation of a digital image into something easier to understand and analyze. An image is a grid of pixels.

Domain and Applications

Object detection tasks
- Face detection
- Pedestrain detection
- Locating specific objects in satelite images
Recognition tasks
- Face recognition
- Fingerprint recognition
Object localization
Traffic control systems
Video survelliance systems

Different kinds of image segmentation

Semantic segmentation (e.g., person and background)
Instance segmentation (e.g. each person will be identified individually)

Traditional Computer Vision Approaches

Thresholding method (Region-based Segmentation) ==> (changing a grayscale image into a binary image based a threshold)
Edge Detection method
- Using weight matrices (filters) and convoluting them with images
Clustering method: e.g., K-means clustering

DL-based Appoaches

Link to the paper: paper

Fully Convolutional Networks (FCNs) - These networks consist only of convolutional layers. Skip connections allow feature maps from final layers to be up-sampled and fused with feature maps of earlier layers, which helps the model to produce a very accurate and detailed segmentation by combining the semantic information from the deep and coarse layers with the appearance information from the shallow and fine layers.
Convolutional Models with Graphical Models
- These models came into existense as deep CNNs have poor localization property. So, the responses at the final CNN layers were combined with Conditional Random Field (CRF), which resulted in higher accuracy than the FCNs.
Encoder-Decoder Based Models (Our Focus in this project) - It is divided into two main categories: 1. Encoder-Decoder Models for General Segmentation: They consist of two parts: encoder, decoder. An encoder uses convolutional layers, however, a decoder uses a deconvolutional network, which generates a maps of pixel-wise class probabilities based on the input feature vector. (popular models: SegNet, HRNet) 2. Encoder-Decoder for Medical and biomedical Image Segmentation: U-Net and V-Net are the two most popular ones. U-Net is usually used for the segmenation of biological microscopy images, and V-Net is used for 3D medical image segmentation. It uses data augmentation to learn from the available annotated images. U-Net architecture consists of two parts: a contracting part and a symmetric expanding path, for capturing context and enabling precise localization respectively. V-Net uses a new objective function for model training which is based on Dice coefficient. V-Net model is trained on MRI volumes and predicts the segmentation for the whole MRI volume at once.
R-CNN based Models (Instance sampling)
- Regional Convolutional Neural Networks, the goal is to address the problem of instance segmentation.
Multi-scale and Pyramid Network based Models
Dilated Convolutional Models and DeepLab Family
Recurrent Neural Network based Models
Attention-based Models
Generative Models and Adversarial Training
CNN Models with Active Contour Models

Final comparison of the methods:

Image Segmentation Datasets

The datasets for this task:

The Cambridge driving labeled Video databases (CamVids) (our focus for this mini-project) Pre train/test/val split on kaggle
The Cityscapes Dataset
PASCAL Visual Object Classes (PASCAL VOC)
Common Objects in COntext — Coco Dataset

Evaluation Metrics

Pixel Accuracy (PA): Ratio of pixels properly classified, divided by the total number of pixels. If we have K foreground classes with background. p_ij is the number of pixels of class i predicted as belonging to class j.

Mean Pixel Accuracy (MPA): Ratio of correct pixels is computed in a per-class manner and then averaged over the total number of classes.

Intersection over Union (IoU): the other name of this metric is Jaccard Index.

Mean-IoU: is defined as the average IoU over all classes. It is widely used in reporting the performance of modern segmentation algorithms.
Precision/ Recall/ F1 Score: precision and recall are defined as follows:

TP, FP, and FN stand for True Positive, False Positive, and False Negative, respectively.

F1 score: is defined as harmonic mean of precision and recall.

Dice Coefficient: this metric is commonly used in medical image analysis. It is defined as twice the overlap area of predicted and ground-truth maps, divided by the total number of pixels in both images.

Encoder-Decoder based DL Models

UNet Variations

For this project, we selected UNet model and looked at the following variants of it.

UNet + ReLU
UNet + Leaky ReLU
small UNet + ReLU
small UNet + Leaky ReLU
extended UNet + ReLU
extended UNet + Leaky ReLU
extended UNet + ReLU + Dropout
overextended UNet + ReLU
overextended UNet + Leaky ReLU
overextended UNet + ReLU + Dropout

Pretrained FCN ResNet 50

Using a pretrained FCN_RESNET50 model provided by torchvision we finetune on all layers and change the classifier and aux classifier to have an output size of 32 channels. Adapted from here.

Trained variations

default
flip (Default with data augmentation Random Horizontal Flip)

Training

Eval

Augmentation did at first not improve results. However, this was due to a buggy usage of the RandomHorizontalFlip class: It was not ensured that the augmentation is performed on both/ or neither => possible that only input or label was flipped which obviously worsended results. After a fix and a rerun the augmentation slightly helped, also not significantly.

Example segmentations on Test set:

State of the art approach

Many of the papers found mentioned UNET as the state of the art within image segmentation. Therefore we are focusing more on alternatives to this

CNN and Transformer mix

https://reader.elsevier.com/reader/sd/pii/S0031320322007075?token=83FCA21D1027C3BFBF95656B895BEF0A262DF1328A65F90845CA8D0D34707028CBA43303B74BCF9E3624B7F3937DE7AB&originRegion=eu-west-1&originCreation=20230504124206

In the domain of medical imaging, a big problem has been cases of organ identification images, where some organs are small or thin while other are big. Many models typically have problems capturing especially the small and thin ones. Therefore, in the paper, they propose a model that mixes a CNN with a transformer model, such that they run in parallel with each other. They do this as transformer models have proven to be good at the task of long distance relationships. That makes it easier for the model to capture these small details in images, that would normally not be captured by standard CNN models. The model evaluated to pretty average results when compared to other state of the art models, yet especailly for the smaller organs in the images, the model gave improvements (while for bigger organs, the model had slightly lower performance).

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
images		images
models		models
.gitattributes		.gitattributes
.gitignore		.gitignore
AML4DS.pdf		AML4DS.pdf
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

images

images

models

models

.gitattributes

.gitattributes

.gitignore

.gitignore

AML4DS.pdf

AML4DS.pdf

README.md

README.md

Repository files navigation

Image Segmentation with Encoder Decoders

Team members

Background

Definition

Goal

Domain and Applications

Different kinds of image segmentation

Traditional Computer Vision Approaches

DL-based Appoaches

Image Segmentation Datasets

Evaluation Metrics

Encoder-Decoder based DL Models

UNet Variations

Pretrained FCN ResNet 50

Trained variations

Training

Eval

State of the art approach

CNN and Transformer mix

About

Releases

Packages

Contributors 3

Languages

ehsanyousefzadehasl/ISwED

Folders and files

Latest commit

History

Repository files navigation

Image Segmentation with Encoder Decoders

Team members

Background

Definition

Goal

Domain and Applications

Different kinds of image segmentation

Traditional Computer Vision Approaches

DL-based Appoaches

Image Segmentation Datasets

Evaluation Metrics

Encoder-Decoder based DL Models

UNet Variations

Pretrained FCN ResNet 50

Trained variations

Training

Eval

State of the art approach

CNN and Transformer mix

About

Topics

Resources

Stars

Watchers

Forks

Languages