Skip to content

xultaeculcis/coral-net

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

42 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

coral-net

Coral image dataset.

Why?

I was looking for a dataset that would be hard for the model to train on - something similar to ImageWoof https://github.com/fastai/imagenette but larger in size, with large class imbalances, something that would be a challenge.

Dataset

The dataset was created by web scraping images from Facebook, Instagram, Google Images, Duck Duck Go Images and various coral seller pages (WWC, White Corals, Euro Corals etc.) - around 30 different sources.

Final dataset consisted of around 200K unlabeled images. I had manually labeled them into 37 classes (initially, around 100 images per class) described below:

'Acanthastrea',
'Acanthophyllia & Cynarnia',
'Acropora',
'Alveopora & Goniopora',
'Blastomussa',
'Bubble Coral',
'Bubble Tip Anemone',
'Candy Cane Coral',
'Carpet Anemone',
'Chalice',
'Cyphastrea',
'Discosoma Mushroom',
'Elegance Coral',
'Euphyllia',
'Favia',
'Gorgonia',
'Leptastrea',
'Leptoseris',
'Lobophyllia & Trachyphyllia & Wellsophyllia',
'Maze Brain Coral Platygyra',
'Mini Carpet Anemone',
'Montipora',
'Pavona',
'Plate Coral Fungia',
'Porites',
'Psammacora',
'Rhodactis Mushroom',
'Ricordea Mushroom',
'Rock Flower Anemone',
'Scolymia',
'Scroll Corals Turbinaria',
'Star Polyps',
'Styllopora & Pocillipora & Seriatopora',
'Sun Corals',
'Toadstool & Leather Coral',
'Tridacna Clams',
'Zoa'

Sample images: samples

Some of the classes consist of couple of different genera since to distinguish between them takes a domain expert - which I am not...

Following Data preprocessing steps have been applied:

  • Rename all images into following format <guid with no "-" signs>.<extension>
  • Convert all images to *.jpg
  • Duplicate image removal using CNN method from imagededup package with 95% similarity threshold

After doing the training loop for couple of times I've managed to build a fairly clean dataset of 130K images, 4.3GB in size. The rest of them are still unlabeled. The model tends to give high confidence to images that do not represent a coral (images of people, equipment, text etc), possible tweaks are needed...

class counts

Due to possible legal stuff I am not providing a direct link for the dataset.

Training loop

Once every class had at least 100 images I trained a simple resnet34 ensemble to get better results.

  • 10-k Fold CV was used
  • I've tried to use bigger models like resnext50_32x4d, but the results did not improve in any meaningful way, and training time increased significantly (even with mixed precision training)
  • Mixed Precision slightly improved the results and significantly improved training time
  • Inference time decision was done using voting:
    • If all classifiers voted unanimously then no soft voting was performed
    • Otherwise soft voting was performed and decision threshold was set to 90%
  • After entire dataset was assigned weak labels, I manually went through the images and applied corrections if necessary
  • Weakly labeled images were then added to the training set and the whole process was repeated

This process is called Human-in-the-loop training. Other approach would be to use something called Self-training with Noisy Student https://arxiv.org/abs/1911.04252. But that requires way bigger dataset.

Tricks and tools

Results

The single best model reached F1 Score of 0.7474 (which was my target metric). Other metrics for this model were as follow:

Metrics

Metric Name Metric Value
F1-Score 0.7474
Overall Accuracy 0.9561
Top 1 Accuracy 0.9789
Top 5 Accuracy 0.9971
Precision 0.7483
Recall 0.7548
Loss 0.1055

Confusion Matrix

Confusion matrix on last training loop: Confusion matrix

Logs

Validation logs: Val logs

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published