Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mobilenet v1 with cifar10 unexpected behavior #21058

Closed
xiao1228 opened this issue Jul 23, 2018 · 5 comments
Closed

Mobilenet v1 with cifar10 unexpected behavior #21058

xiao1228 opened this issue Jul 23, 2018 · 5 comments
Assignees

Comments

@xiao1228
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04):16.04
  • Mobile device (e.g. iPhone 8, Pixel 2, Samsung Galaxy) if the issue happens on mobile device:NA
  • TensorFlow installed from (source or binary):SOURCE
  • TensorFlow version (use command below):1.8
  • Python version:3.5

Hi,

I am using mobilenet_v1_eval code, instead of the imagenet dataset, i change the data to cifar10 training from scratch. The only change for the architecture is 1st conv layer with stride 1 instead of 2.

_CONV_DEFS = [
Conv(kernel=[3, 3], stride=1, depth=32),
DepthSepConv(kernel=[3, 3], stride=1, depth=64),
DepthSepConv(kernel=[3, 3], stride=2, depth=128),
DepthSepConv(kernel=[3, 3], stride=1, depth=128),
DepthSepConv(kernel=[3, 3], stride=2, depth=256),
DepthSepConv(kernel=[3, 3], stride=1, depth=256),
DepthSepConv(kernel=[3, 3], stride=2, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=1, depth=512),
DepthSepConv(kernel=[3, 3], stride=2, depth=1024),
DepthSepConv(kernel=[3, 3], stride=1, depth=1024)
]

Training was no problem, loss decrease and prediction seems good. But in the evaluation, I use the same code as the mobilenet_v1_eval with input data as cifar10, I am gettting the same output for each image I pass in to the model. I have double checked the my input is definitely different every time, but it is very weird to get an exact same output for different images.

[[-0.11333117 -0.5380551 0.18907356 0.7664664 0.07711207 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333117 -0.5380551 0.18907356 0.7664664 0.07711207 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333118 -0.5380551 0.18907356 0.7664664 0.07711206 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333118 -0.5380551 0.18907356 0.7664664 0.07711206 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]
[[-0.11333118 -0.5380551 0.18907356 0.7664664 0.07711206 0.04618246
0.13568665 0.1360816 -0.36744678 -0.33176792]]

Please help, any suggestion will be helpful! Thank you in advance!

@xiao1228
Copy link
Author

I have figured out issue is due to the slim.batch_norm, like other people was having the same problem as well (i.e. tensorflow/models#3556)
BUT in the mobilenet_v1 eval code, scope = mobilenet_v1.mobilenet_v1_arg_scope(is_training=False,weight_decay=0.0)
if I set the is_training to True in eval, it outputs different predictions, if I set is_training to False (which I think I should) the predictions are the same for different images.

I see other people are mention training using slim.learning.create_train_op can solve the prolem, this is what I am using. but I am still having the issue.

So I am confused about the slim.batch_norm in mobilenetv1 now, should I set the is_training to True in eval? or is there anything else that I am missing?
Thank you in advance.

@NPetsky
Copy link

NPetsky commented Jul 25, 2018

Hello @xiao1228, I had a similar error and I found out that I didn't save the moving mean and moving variance variables from slim.batch_norm after training, so I couldn't use is_training=False
Do you create a Saver with tf.trainable_variables()? If that's the case you should remove tf.trainable_variables() and create a saver like this: saver = tf.train.Saver() then you save tf.global_variables() including moving mean and moving variance

@xiao1228
Copy link
Author

Thank you @NPetsky You are right it is due to the moving mean & variance. I find another way to solve it by just adding 'updates_collections':None, in the batch_norm_params. This is suggested here #1122
However, I am using the code directly from the example, so I am wondering, for imagenet training and eval, did people have the same issue? Or just cifar10 cause this problem?

@NPetsky
Copy link

NPetsky commented Jul 26, 2018

Good question, example code is supposed to work without code manipulation :)
Does your eval work now with 'updates_collections':None in training? Because I thought if you use slim.learning.create_train_op moving mean and moving variance are updated and you don't need 'updates_collections':None (this only updates the variables in place instead of adding them to the GraphKeys.UPDATE_OPS collection)
Another link maybe useful: #11965

@xiao1228
Copy link
Author

Yea, everything seems to work fine with 'updates_collections':None both in training and eval. However, as other issues mentioned (#1122), this may take longer time to train as it is not efficient. But if I am using slim.learning.create_train_op with update_ops =tf.get_collection( GraphKeys.UPDATE_OPS ) like the original code, my eval results was the same for different images. Another reason might also due to the batch_norm_decay, I see in #11965, they also mentioned that. The original value was 0.997 for imagenet, and I changed it to 0.9. With values like 0.997 it may requires more steps to see the changes in eval results, however we don't know what is the roughly the step size for cifar10. I use decay as 0.997 with update_ops =tf.get_collection( GraphKeys.UPDATE_OPS ) (original code) and ran it up to 10k steps still output the same prediction for different images. But after I changed it to 0.9 with 'updates_collections':None , in the first 50 steps or less I already can see that eval predictions give different labels.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants