Mobilenet v1 with cifar10 unexpected behavior #21058

xiao1228 · 2018-07-23T16:20:03Z

System information

OS Platform and Distribution (e.g., Linux Ubuntu 16.04):16.04
Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:NA
TensorFlow installed from (source or binary):SOURCE
TensorFlow version (use command below):1.8
Python version:3.5

Hi,

I am using mobilenet_v1_eval code, instead of the imagenet dataset, i change the data to cifar10 training from scratch. The only change for the architecture is 1st conv layer with stride 1 instead of 2.

_CONV_DEFS = [
Conv(kernel=[3, 3], stride=1, depth=32),
DepthSepConv(kernel=[3, 3], stride=1, depth=64),
DepthSepConv(kernel=[3, 3], stride=2, depth=128),
DepthSepConv(kernel=[3, 3], stride=1, depth=128),
DepthSepConv(kernel=[3, 3], stride=2, depth=256),
DepthSepConv(kernel=[3, 3], stride=1, depth=256),
DepthSepConv(kernel=[3, 3], stride=2, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=2, depth=1024),
DepthSepConv(kernel=[3, 3], stride=1, depth=1024)
]

Training was no problem, loss decrease and prediction seems good. But in the evaluation, I use the same code as the mobilenet_v1_eval with input data as cifar10, I am gettting the same output for each image I pass in to the model. I have double checked the my input is definitely different every time, but it is very weird to get an exact same output for different images.

[[-0.11333117 -0.5380551 0.18907356 0.7664664 0.07711207 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333117 -0.5380551 0.18907356 0.7664664 0.07711207 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333118 -0.5380551 0.18907356 0.7664664 0.07711206 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333118 -0.5380551 0.18907356 0.7664664 0.07711206 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333118 -0.5380551 0.18907356 0.7664664 0.07711206 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]

Please help, any suggestion will be helpful! Thank you in advance!

xiao1228 · 2018-07-25T14:32:34Z

I have figured out issue is due to the slim.batch_norm, like other people was having the same problem as well (i.e. tensorflow/models#3556)
BUT in the mobilenet_v1 eval code, scope = mobilenet_v1.mobilenet_v1_arg_scope(is_training=False,weight_decay=0.0)
if I set the is_training to True in eval, it outputs different predictions, if I set is_training to False (which I think I should) the predictions are the same for different images.

I see other people are mention training using slim.learning.create_train_op can solve the prolem, this is what I am using. but I am still having the issue.

So I am confused about the slim.batch_norm in mobilenetv1 now, should I set the is_training to True in eval? or is there anything else that I am missing?
Thank you in advance.

NPetsky · 2018-07-25T16:52:23Z

Hello @xiao1228, I had a similar error and I found out that I didn't save the moving mean and moving variance variables from slim.batch_norm after training, so I couldn't use is_training=False
Do you create a Saver with tf.trainable_variables()? If that's the case you should remove tf.trainable_variables() and create a saver like this: saver = tf.train.Saver() then you save tf.global_variables() including moving mean and moving variance

xiao1228 · 2018-07-26T12:06:55Z

Thank you @NPetsky You are right it is due to the moving mean & variance. I find another way to solve it by just adding 'updates_collections':None, in the batch_norm_params. This is suggested here #1122
However, I am using the code directly from the example, so I am wondering, for imagenet training and eval, did people have the same issue? Or just cifar10 cause this problem?

NPetsky · 2018-07-26T14:10:00Z

Good question, example code is supposed to work without code manipulation :)
Does your eval work now with 'updates_collections':None in training? Because I thought if you use slim.learning.create_train_op moving mean and moving variance are updated and you don't need 'updates_collections':None (this only updates the variables in place instead of adding them to the GraphKeys.UPDATE_OPS collection)
Another link maybe useful: #11965

xiao1228 · 2018-07-26T14:37:51Z

Yea, everything seems to work fine with 'updates_collections':None both in training and eval. However, as other issues mentioned (#1122), this may take longer time to train as it is not efficient. But if I am using slim.learning.create_train_op with update_ops =tf.get_collection( GraphKeys.UPDATE_OPS ) like the original code, my eval results was the same for different images. Another reason might also due to the batch_norm_decay, I see in #11965, they also mentioned that. The original value was 0.997 for imagenet, and I changed it to 0.9. With values like 0.997 it may requires more steps to see the changes in eval results, however we don't know what is the roughly the step size for cifar10. I use decay as 0.997 with update_ops =tf.get_collection( GraphKeys.UPDATE_OPS ) (original code) and ran it up to 10k steps still output the same prediction for different images. But after I changed it to 0.9 with 'updates_collections':None , in the first 50 steps or less I already can see that eval predictions give different labels.

tensorflowbutler assigned shivaniag Jul 24, 2018

xiao1228 closed this as completed Jul 27, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mobilenet v1 with cifar10 unexpected behavior #21058

Mobilenet v1 with cifar10 unexpected behavior #21058

xiao1228 commented Jul 23, 2018

xiao1228 commented Jul 25, 2018

NPetsky commented Jul 25, 2018

xiao1228 commented Jul 26, 2018

NPetsky commented Jul 26, 2018

xiao1228 commented Jul 26, 2018

Mobilenet v1 with cifar10 unexpected behavior #21058

Mobilenet v1 with cifar10 unexpected behavior #21058

Comments

xiao1228 commented Jul 23, 2018

System information

xiao1228 commented Jul 25, 2018

NPetsky commented Jul 25, 2018

xiao1228 commented Jul 26, 2018

NPetsky commented Jul 26, 2018

xiao1228 commented Jul 26, 2018