Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to train imagenet with reduced memory and batch size? #430

Closed
research2010 opened this issue May 21, 2014 · 23 comments
Closed

How to train imagenet with reduced memory and batch size? #430

research2010 opened this issue May 21, 2014 · 23 comments
Labels

Comments

@research2010
Copy link

Hi, thank you very much for this valuable library!

The hardware and software environments are as follows:

  1. NVIDIA GTX 750 Ti (2G)
  2. Ubuntu 12.04

When with the default train configuration file for imagenet data set, the train_net.bin will error with "out of memory". So I change the batch_size into 64 (128 also not valid). Then it works!
The following is the output of train_net.bin:
qq20140521152144

And the results are as follows after 2000 iterations:
qq20140521155321

It seems the testing scores are not changed. As indicated in #218, @sguada said that the batch_size and the learning rate are linked. I have set the batch_size is 64, maybe the learning rate should also be modified. Could anyone give any advice on this subject, please?

@shelhamer shelhamer changed the title report on my enviroment How to train imagenet with reduced memory and batch size? May 22, 2014
@sguada
Copy link
Contributor

sguada commented May 22, 2014

@research2010 Did you changed the batch_size for the validation.prototxt? That would also help you reduce the memory usage.
Are you using the latest dev since #355 training and testing share the data blobs and save quite a bit of memory.

Regarding the batch_size=64 for training should be okay, although base_lr is linked to the batch_size, it allows some variability. Originally base_lr = 0.01 with batch_size=128, we have also used with batch_size=256 and still works. In theory when you reduce the batch_size by a factor of X then you should increase the base_lr by a factor of sqrt(X), but Alex have used a factor of X (see http://arxiv.org/abs/1404.5997)

What you should change is the stepsize and max_iter, accordingly to keep the same learning scheduling. If you divide the batch_size by X then you should multiply those by X.

Pay attention to the loss, if it doesn't go below 6.9 (which is basically random guessing) after 10k-20k iterations, then your training is not learning anything.

@research2010
Copy link
Author

@sguada , Thank you very much for your kind comments and suggestions.

I use the "git clone https://github.com/BVLC/caffe.git" to checkout the latest version at 2014-05-20. So maybe it isn't the dev branch, but it seems to have been patched by https://github.com/BVLC/caffe/pull/355/commits. I'll check the dev branch and rerun the experiments.

Recently I have been using the GPU card to run other experiments. So I couldn't give the results in time. I'll give feedback as soon as the experiments on ImageNet data set restart.

@kloudkl
Copy link
Contributor

kloudkl commented Jul 3, 2014

#355 is not merged into dev yet.

@research2010
Copy link
Author

@sguada @kloudkl , thank you very much for replying!

I have been running the imagenet example again. And some results are as follows:

  1. When I use the caffe-0.9 and the latest dev branch and use the train_imagenet.sh to train the model, it seems the test score don't decrease. And as suggested by @sguada , I modified as follows:
    (1) in the imagenet_train.prototxt, the batch_size is 128,
    (2) in the imagenet_val.prototxt, the batch_size is 16,
    (3) in the imagenet_solver.prototxt, the learning rate is 0.014142, the stepsize is 200000 and the max_iter is 900000.
    and after 20k iterations, the test score is still 6.9.
  2. When I use the latest dev branch and use the train_alexnet.sh to train the model, it works fine! But the modification are as follows:
    (1) in the alexnet_train.prototxt, the batch_size is 64,
    (2) in the alexnet_val.prototxt, the batch_size is 32,
    (3) in the alexnet_solver.prototxt, the learning rate is 0.02, the stepsize is 400000 and the max_iter is 1800000.
    and after only 4k iterations,

qq20140712075217

but when I use 128 as the training batch_size and 16 as the val batch_size, training with alexnet will error with out of memory.

It seems that training with the alexnet works fine. I'm not sure what the problem of training caffenet is.
The hardware and software environments are as follows:

  1. NVIDIA GTX 750 Ti (2G)
  2. Ubuntu 12.04
  3. cuda 6.0
    and the make runtest is fine but just output 2 tests are disabled as warning.

@research2010
Copy link
Author

And the two net is

caffenet_alexnet

@sguada
Copy link
Contributor

sguada commented Jul 12, 2014

Try setting the bias to 0.1 in all the layers

Sergio

2014-07-11 17:21 GMT-07:00 research2010 notifications@github.com:

And the two net is

[image: caffenet_alexnet]
https://cloud.githubusercontent.com/assets/1638818/3560107/7e88751c-095a-11e4-9ac5-9f95fc9c7b17.jpg


Reply to this email directly or view it on GitHub
#430 (comment).

@research2010
Copy link
Author

@sguada , OK, thank you!

I will try that after the training of the alexnet model is done.
It takes 2 hours for 7k iterations, so the total time will be 21 days for all 1800000 iterations!
I wish the computer and the graphics card will be safe!

@research2010
Copy link
Author

@sguada , I'm sorry about that I just made a mistake for typing your name and "sergeyk" and I have corrected that.

@research2010
Copy link
Author

@sguada , oh, I just forget that we could resume the training procedure. That's very convenient!

@research2010
Copy link
Author

Hi, @sguada , I got some results when I replaced the 1 to 0.1 in bias filters. But it is very different from the results published in #33,

caffenet_trainloss_vs_iters_

caffenet_test_accuracy_vs_iters_

@sguada
Copy link
Contributor

sguada commented Jul 18, 2014

It looks good to me. Given your reduced batch you will need to train for
many more iterations probably 1million. And reduce the lr when necessary.

On Saturday, July 12, 2014, conaniron notifications@github.com wrote:

Hi, @sguada https://github.com/sguada , I got some results when I
replaced the 1 to 0.1 in bias filters. But it is very different from the
results puslished in #33 #33,

[image: caffenet_trainloss_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563469/7dfc6264-0a38-11e4-9c6d-2fa822a769a7.gif

[image: caffenet_test_accuracy_vs_iters_]
https://cloud.githubusercontent.com/assets/1638818/3563470/8d591086-0a38-11e4-96c4-2917b361d2d4.gif


Reply to this email directly or view it on GitHub
#430 (comment).

Sergio

@research2010
Copy link
Author

@sguada , Thanks for your kindly comments.

I've been running the training of caffenet for about one week, and the results as follows is smilar to but a little different from that you have presented in #33. As the reduced batch, it indeed needs more iterations as you said. And in this time of training, I just set the max_iter to 900000 for 90 epochs. It indeed needs more parameter adjustments, "To train these models is more of an art than a science" as indicated by Matthew Zeiler in http://www.wired.com/2014/07/clarifai/. Thank you very much for sharing your valuable experience and results of parameter adjustment.

caffenet_test_accuracy_vs_iters 2 _

caffenet_trainloss_vs_iters 2 _

@research2010
Copy link
Author

Finally, the training has similar behavior with that in #33, and the testing accuracy is ~56%, ~1% lower than that in #33 and ~3.9% lower than that in Alex's paper in 2012.
It takes about 14 days for ~660000 iterations, and ~90s for 5120 images, which is much larger than the 26s of K20.

The configuration is:
Ubuntu 12.04
GTX 750 Ti (2G)
CUDA 6.0
Driver 331.44

caffenet_test_accuracy_vs_iters

caffenet_trainloss_vs_iters

@shelhamer
Copy link
Member

Good to hear you got it working with the proper tuning!

@research2010
Copy link
Author

@shelhamer , thanks for your comments!
Finally, it took 17 days for the training. However, there are just 20 17-day in a year. With the limited hardware, I didn't try the parameter adjustment. Many thanks to @sguada and guys who shared their experience of parameter tuning in #33, they helps me a lot!

caffenet_test_accuracy_vs_iters_

caffenet_trainloss_vs_iters_

@research2010
Copy link
Author

It takes about 3 hours and 20 minutes to train the first 10000 iterations of the BVLC_reference_caffenet model with cuDNN, and that above is about 4 hours and 40 minutes.
It is suggested to train with the switch of cuDNN on.

zheden added a commit to zheden/HandwritingAuthorRecognition that referenced this issue Jun 9, 2015
…that image width was scaled twice disappeared

2. Reduced batch size for pairs network and changed parameters in solver (took from imagenet and scaled acc to info in BVLC/caffe#430)
Minimization dont converge.. but at least running now
@WoooHaa
Copy link

WoooHaa commented Aug 28, 2015

@research2010 Hello,I see the accuracy result you plot has "second increase phase" in iter 200000.
How did you do it? My training is running for one month but it dose not increase any more since it get first "bottleneck" .
Thanks

@DAIK0N
Copy link

DAIK0N commented Nov 9, 2015

@research2010
hey you commented on Jul 12, 2014 with the 2 pictures of the caffenet and alexnet, did you parse the prototxt file and print them out via graphviz? or how did u produce these two images?

@jstaker7
Copy link

Sorry to chime in so late on a closed issue -- but I'm trying to understand the same thing that WoooHaa commented about. What is the cause of the "bottlenecks" and how are these overcome? It seems dangerously easy to wait so long and think that training has converged to an optimal value, when it hasn't yet.

@DAIK0N
Copy link

DAIK0N commented Feb 22, 2016

thats the "step" a change in the learning rate. So when there is a failure it changes the weights with a stronger effect. When u would start with that higher learning rate from the beginning, your program would start to bounce and would never get better so you have to start with a lower learning rate and increase it when your system reaches saturation. In the plots you can see that he set his step value to 200 000 because you see these changes at 200 000, 400 000 and 600 000.

@jstaker7
Copy link

Thank you for the response! Just to clarify... I usually start with a higher learning rate and decrease it over time. But what you say is actually increase the learning later on during training?

@DAIK0N
Copy link

DAIK0N commented Feb 22, 2016

#430 (comment)
oh you are right.. you have to drop the learning rate
http://caffe.berkeleyvision.org/tutorial/solver.html
made own test on the learning rate 4 month ago and did get confused...

@jstaker7
Copy link

Ah gotcha, it all makes sense now. Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants