Skip to content
This repository has been archived by the owner on Aug 5, 2022. It is now read-only.

Multinode googlenet

Li, Shusen edited this page Apr 8, 2018 · 5 revisions

Multi-node GoogLeNet

This is a part of Multi-node guide. It is assumed you have completed the cluster configuration and Caffe build tutorials.

Introduction

The tutorial explains how to train GoogLeNet. It extends CIFAR10 tutorial, so please complete CIFAR10 tutorial first in case you haven't done it yet.

You can use either LMDB, compressed LMDB or images in order to specify the ImageNet data set.

With Image Data Layer

If you have chosen to use images, configure image data layer in train_val proto configuration (i.e. models/bvlc_googlenet/train_val_client.prototxt). You need to replace default data layer in GoogleNet model:

name: "GoogleNet"
layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 224
    mean_value: 104
    mean_value: 117
    mean_value: 123
  }
  image_data_param {
    source: "data/ilsvrc12/train.txt"
    batch_size: 512
    shuffle: true
  }
}
layer {
  name: "data"
  type: "ImageData"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    mirror: false
    crop_size: 224
    mean_value: 104
    mean_value: 117
    mean_value: 123
  }
  image_data_param {
    source: "data/ilsvrc12/val.txt"
    batch_size: 50
    new_width: 256
    new_height: 256
  }
}

Define the type of data layer as image_data_param. The "data/ilsvrc12/train.txt" will contain list of all images used for training together with class id. The shuffle parameter ensures that each node shuffles the images from full training set with a dfferent seed. You can train GoogLeNet with total batch size of 1024 and learning rate equal to 0.06, so set the batch of individual node to a value B/K, where B is the expected total batch size (1024 here) and K is number of nodes you want to train with. Here, 2 nodes are assumed.

The solver definition in models/bvlc_googlenet/solver_client.prototxt should have update learning rate and maximum iteration:

net: "models/bvlc_googlenet/train_val_client.prototxt"
test_interval: 1000
test_iter: 1000
test_initialization: false
display: 40
average_loss: 40
base_lr: 0.06
lr_policy: "poly"
power: 0.5
max_iter: 91000
momentum: 0.9
weight_decay: 0.0002
solver_mode: CPU
snapshot: 10000
snapshot_prefix: "multinode_googlenet"

All you have to do now, is to run training:

mpirun --hostfile path/to/hostfile -n 2 ./build/tools/caffe train \
--solver=models/bvlc_googlenet/quick_solver.prototxt

With LMDB

To run with lmdb change a few things in train_val_client.prototxt:

layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 224
    mean_value: 104
    mean_value: 117
    mean_value: 123
  }
  data_param {
    source: "/home/data/lmdb_compressed/ilsvrc12_train_lmdb"
    shuffle: true
    batch_size: 512
    backend: LMDB
  }
}
layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TEST
  }
  transform_param {
    mirror: false
    crop_size: 224
    mean_value: 104
    mean_value: 117
    mean_value: 123
  }
  data_param {
    source: "/home/data/lmdb_compressed/ilsvrc12_val_lmdb"
    batch_size: 50
    backend: LMDB
  }
}

The uncompressed LMDB works the same as compressed, although the compressed should be faster.

Clone this wiki locally