Skip to content

griffinbran/Object-Detection-of-Canines-with-Transfer-Learning

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Machines Best Friend

Capstone Project


Problem Statement

Can an image be used to accurately describe itself?

  • Visual Question Answering (VQA)

  • Audience: Computer vision enthusiasts, dog lovers, security services, and the visually impaired

  • Image data is a rich source of information. This project will aims to automate the task of extracting image descriptions.

    Questions to be explored:

  1. Is the dog inside or outside?
  2. Does it have a friend?
  3. What breed is it?
  4. What layers need to be pre-trained?
  5. What is a reasonable 'optical cue'?

--- ### Overview

This DSI module covers:

  • Machine Learning for Deep Neural Networks (TensorFlow, Keras API)
  • Binary Classification Predictive Modeling
  • Computer Vision ( RGB image processing, image formation, feature detection, computational photography)
  • Convolutional Neural Networks(CNN)- regularization, automated pattern recognition, ...
  • Transfer Learning with a pre-trained deep learning image classifier (VGG-16 CNN from Visual Geometry Group in 2014)
  • Automatic photo captioning, Visual Question Answering (VQA)

Contents


Background

Here is some background info:

  • Transfer learning: pre-existing model, trained on millions of images over the period of several weeks.
  • Eliminates the need to afford cost of training deep learning models from scratch
  • Deep CNN model training short-cut, re-use model weights from pre-trained models previously developed for benchmark tests in comupter vision
  • VGG, Inception, ResNet:
  • Weight initialization: weights in re-used layers used as starting point in training and adapted in response to new problem
  1. Use model as-is to classify new photographs
  2. Use as feature extraction model, output of pre-trained from a layer prior to output layer used as input to new classifier model
  • Tasks more similar to the original training might rely on output from layers deep in the model such as the 2nd to last fully connected layer
  • Layers learn:
  1. Layers closer to the input layer of the model: Learn low-level features such as lines, etc.
  2. Layers in the middle of the network of layers: Learn complex abstract features that combine the extracted lower-level features from the input
  3. Layers closer to the output: Interpret the extracted features in the context of a classification task
  • Fine-tuning learning rate of pre-trained model
  • Transfer Learning Tasks
  • Architectures:
  1. Consistent and repeating structures (VGG)
  2. Inception modules (GoogLeNet)
  3. Residual modules (ResNet)

Data Dictionary

NOTE: Make sure you cross-reference your data with your data sources to eliminate any data collection or data entry issues.
See Acknowledgements and Contact section for starter code resources

Feature Type Dataset Category Description
variable1 dtype Origin of Data Category Description
variable2 dtype Origin of Data Category Description
IMAGE_HEIGHT int utils.py Global Variable 224(pixels)
IMAGE_WIDTH int utils.py Global Variable 224(pixels)
IMAGE_CHANNELS int utils.py Global Variable 3-RGB Channels
variable2 dtype Origin of Data Category Description
variable1 dtype Origin of Data Category Description
VGG-16 Block Name (Type) Kernel Size Nodes Params # Stride/Pool Output ( h x w x depth )
00-First input1 (Input) No Filter None 0 None ( Batch, 224, 224, 3-RGB )
01-Block 01 conv1 (Conv2D) ( 3 x 3 ) 64 1,792 ( 1 x 1 ) ( Batch, 224, 224, 64 )
02-Block 01 conv2 (Conv2D) ( 3 x 3 ) 64 36,928 ( 1 x 1 ) ( Batch, 224, 224, 64 )
03-Block 01 pool1 (MaxPooling2D) ( 2 x 2 ) None 0 ( 2 x 2 ) ( Batch, 112, 112, 64 )
04-Block 02 conv1 (Conv2D) ( 3 x 3 ) 128 73,856 ( 1 x 1 ) ( Batch, 112, 112, 128 )
05-Block 02 conv2 (Conv2D) ( 3 x 3 ) 128 147,584 ( 1 x 1 ) ( Batch, 112, 112, 128 )
06-Block 02 pool2 (MaxPooling2D) ( 2 x 2 ) None 0 ( 2 x 2 ) ( Batch, 56, 56, 128 )
07-Block 03 conv1 (Conv2D) ( 3 x 3 ) 256 295,168 ( 1 x 1 ) ( Batch, 56, 56, 256 )
08-Block 03 conv2 (Conv2D) ( 3 x 3 ) 256 590,080 ( 1 x 1 ) ( Batch, 56, 56, 256 )
09-Block 03 conv3 (Conv2D) ( 3 x 3 ) 256 590,080 ( 1 x 1 ) ( Batch, 56, 56, 256 )
10-Block 03 pool3 (MaxPooling2D) ( 2 x 2 ) None 0 ( 2 x 2 ) ( Batch, 28, 28, 256 )
11-Block 04 conv1 (Conv2D) ( 3 x 3 ) 512 1,180,160 ( 1 x 1 ) ( Batch, 28, 28, 512 )
12-Block 04 conv2 (Conv2D) ( 3 x 3 ) 512 2,359,808 ( 1 x 1 ) ( Batch, 28, 28, 512 )
13-Block 04 conv3 (Conv2D) ( 3 x 3 ) 512 2,359,808 ( 1 x 1 ) ( Batch, 28, 28, 512 )
14-Block 04 pool4 (MaxPooling2D) ( 2 x 2 ) None 0 ( 2 x 2 ) ( Batch, 14, 14, 512 )
15-Block 05 conv1 (Conv2D) ( 3 x 3 ) 512 2,359,808 ( 1 x 1 ) ( Batch, 14, 14, 512 )
16-Block 05 conv2 (Conv2D) ( 3 x 3 ) 512 2,359,808 ( 1 x 1 ) ( Batch, 14, 14, 512 )
17-Block 05 conv3 (Conv2D) ( 3 x 3 ) 512 2,359,808 ( 1 x 1 ) ( Batch, 14, 14, 512 )
18-Block 05 pool5 (MaxPooling2D) ( 2 x 2 ) None 0 ( 2 x 2 ) ( Batch, 7, 7, 512 )
19 4D --> 2D flatten (Flatten) No Filter None 0 None ( Batch, 25,088 )
20-Fully Connected fcon1 (Dense) No Filter 4,096 102,764,544 None ( Batch, 4,096 )
21-Fully Connected fcon2 (Dense) No Filter 4,096 16,781,312 None ( Batch, 4,096 )
22-Last Layer Output (Dense) No Filter 1,000 4,097,000 None ( Batch, 1,000 )
  • NOTE :
    CONV2D: # Param = [ (Kernel-Size x Channel-Depth)+1 ] x Filters-Nodes
    DENSE : # Param = [ ( Input Size/Shape ) + 1 ] x Output Size/Shape

  • Total params: 138,357,544
  • Trainable params: 138,357,544
  • Non-trainable params: 0

CNN Model Split Epoch Loss Accuracy
Bseline MSE Training 01 0.0316 0.3251
Bseline MSE Validation 01 0.0191 0.8220
Bseline MSE Training 02 0.0266 0.3248
Bseline MSE Validation 02 0.0205 0.8240

Data Aquisition & Cleaning

Cloning and Debugging

Data Aquisition & Cleaning

Cloning and Debugging

Cloud Computing / Computing with GPU

  • Google CoLab Pro High-RAM(27.4 GB RAM available in runtime memory) plus GPU had to be used to fit the transfer model without a batch generator(cost $10 for the month). Even with the High-RAM I had to be very careful with order of loading variables into memory. Colab Kernels crashed many times and everytime had to start over from scratch with data loading.

Training the CNN

  • Network architecture: X layers, X convolution layers, X fully connected layers
  • model_vgg16_flatten.h5

Exploratory Analysis

  • Insert EDA details...
  • 108,077 images total in Visual Genome(VG), 3,235 images with dogs (or hot dogs, see below), 1,995 dog pics in training dataset (part 1), 1,240 dog pics in
  • Hot dogs, I saw like anywhere between 6-10 in the images that were supposed to be dogs. This is a problem, because they are randomly labeled improperly. It makes me have to ask the question, what other common words are introducing bias in the AI due to language?

Data Visualization


Findings and Recommendations

Answer the problem statement:

  1. YES, with an accuracy of 97.5% the model can identify a dog in an image it has never seen before. With an accuracy about half a percentage point above the baseline score of 97% which would result if the model predicted every single image had no dog, we can say that the model is better than no model lol. Important to consider, moving forward, would be a batch generator to reduce memory demands by moving old batches of data out of memory and new batches of data into RAM iteratively as the model is training. This allows for a number of benefits such as: A.) Data Augmentation, I actually have written a batch generator to perform image augmentation which will act as a regularization technique by preventing overfitting. By augmenting training data the model never sees the exact same image twice and this is ok because a dog is still a dog even if it flipped, reduced in size, enlarged, rotated, etc. B.) Batch size is given more freedom to choose larger BASE-2 values, because we no longer need load the entire image dataset into memory.
  2. Consider the similarity of images, specifically ImageNet images vs Visual Genome data. Visual Genome images are very random with dog objects as the minor object in many. Images have on average up to 35 objects identified, but I did not look at any ImageNet data. Maybe I could have extracted features from a layer lower in the network near the input and obtained less error.
  3. Predicting breeds would be pretty cool. All I would need to do is ID the breed of dog in over 3K images lol. Ideally an app hosted on Heroku that allows users to upload a dog pic and in return they get the top 5 breed predictions from the model. Top 5 because if someone wants to know the breed of their dog it probably isn't a purebreed. It probably is a mut and multiple breed labels are more apropriate.

Next Steps:


Software Requirements:

https://www.quora.com/What-is-the-VGG-neural-network


Acknowledgements and Contact:

External Resources:

  • [High quality images of dogs] (Unsplash): (source)
  • [VQA labeled images of dogs] (Visual Genome): (source)
  • [Google Open Source: Dog Detection] (Open Images): (source)
  • [Google Open Source: Dog Segmentation] (Open Images): (source)
  • [VGG-19] (Keras API): (source)
  • [ImageNet ILSVRC Competition] (Machine Learning Mastery): (source)

Photo by jesse orrico on Unsplash

<span>Photo by <a href="https://unsplash.com/@wildmooncreative?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Kasey McCoy</a> on <a href="https://unsplash.com/s/photos/dogs?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a></span>

Papers:

  • VisualBackProp: efficient visualization of CNNs (arXiv): (source)
  • Very Deep Convolutional Networks For Large-Scale Image Recognition (arXiv): (source)
  • Transfer Learning in Keras with Computer Vision Models (Machine Learning Mastery): (source)

Contact:

Project Link: (source)


Submission

Materials must be submitted by 4:59 PST on Friday, December 11, 2020.


  • CONV2D: # Param = [ (Kernel-Size x Channel-Depth)+1 ] x Filters-Nodes
  • DENSE : # Param = [ ( Input Size/Shape ) + 1 ] x Output Size/Shape
  • ImageNet Large Scale Visual Recognition Challenge (ILSVRC)
  • evaluates algorithms for object localization/detection from images/videos at scale
  • Visual Geometry Group from Oxford 2014

About

Deep learning for binary classification in Google Colaboratory with >97.5% accuracy on over 108,000 RGB images, including 3,000+ with dogs.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published