Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Moving average and moving variance in Batchnorm aren't updated #11965

Closed
idofr opened this issue Aug 2, 2017 · 13 comments
Closed

Moving average and moving variance in Batchnorm aren't updated #11965

idofr opened this issue Aug 2, 2017 · 13 comments
Labels
stat:community support Status - Community Support type:support Support issues

Comments

@idofr
Copy link

idofr commented Aug 2, 2017

System information

  • Have I written custom code (as opposed to using a stock example script provided in TensorFlow): No
  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): Windows 10
  • TensorFlow installed from (source or binary): pip
  • TensorFlow version (use command below): 1.2.1
  • Python version: 3.5.3
  • Bazel version (if compiling from source): None
  • CUDA/cuDNN version: 8/5.1
  • GPU model and memory: GeForce 1080
  • Exact command to reproduce:

Describe the problem

I'm using the slim wrapper, which in turn returns an instance of BatchNormalization from layers/normalisation.py. All paramers are set to default, except for scale which is set to True (i.e. adding the gamma scaler). After training, when looking the at the learned parameters, I notice that all the moving means in the network are still 0 while all the moving variances are 1, i.e. they weren't updated.

Both variables don't show up in tf.trainable_variables() which might explain the lack of updates. However, since these are not actually learned but rather calculated, I'm not sure whether they would be updated by the optimiser.

@idofr
Copy link
Author

idofr commented Aug 2, 2017

I can't edit my orginal message, so I'll just add a comment.
I tried running the test function with is_training=False (but with the exact same checkpoint as before). The accuracy went from ~98% to roughly 12%.

My theory here is that the batchnorm layer is keeping the mean and variance variables in a different place than it's telling the collection

@poxvoculi poxvoculi assigned mrry and unassigned mrry Aug 2, 2017
@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Aug 3, 2017

You probably forgot this which is written in the document of batch_norm:

  Note: when training, the moving_mean and moving_variance need to be updated.
  By default the update ops are placed in `tf.GraphKeys.UPDATE_OPS`, so they
  need to be added as a dependency to the `train_op`. For example:

    update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    with tf.control_dependencies(update_ops):
      train_op = optimizer.minimize(loss)

  One can set updates_collections=None to force the updates in place, but that
  can have a speed penalty, especially in distributed settings.

@idofr
Copy link
Author

idofr commented Aug 3, 2017

Ok, this was indeed the problem. Many thanks.

Do I need to concatenate this collection with any other one for normal training? Why does it make sense to have it like this?

Are the statistics (mean and var.) also updated without the optimiser settings?

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Aug 3, 2017

Other layers don't have similar caveats AFAIK.

It makes sense because the moving averages are not updated by gradient descent, therefore there must be another way to update it.

@poxvoculi poxvoculi added type:support Support issues stat:community support Status - Community Support labels Aug 3, 2017
@idofr
Copy link
Author

idofr commented Aug 7, 2017

Many thanks for the help and the info.
Yet, I fear I'll have to ask for a re-open as the problem only seems to be half solved.
The mean and variance are properly saved and loaded now (why does it saved the variance btw, when the std is required?). However, when evaluating the model with is_training=False the accuracies are still around 35% while the same script with is_training=True has around 97% accuracy.

I checked if all the weights and parameters are properly loaded and everything seems to be in place

@idofr
Copy link
Author

idofr commented Aug 8, 2017

Same as
#1122

https://stackoverflow.com/questions/42770757/tensorflow-batch-norm-does-not-work-properly-when-testing-is-training-false

https://stackoverflow.com/questions/39353503/tensorflow-tf-slim-model-with-is-training-true-and-false?rq=1

https://stackoverflow.com/questions/44211371/tensorflow-batch-norm-breaks-network-when-is-training-false?rq=1

I'm currently training again with a lower decay to confirm #1122 and will update tomorrow.

Update - the lower decay rate (0.9) and updates_collections=None seemed to do the work.

@keven425
Copy link
Contributor

keven425 commented Aug 28, 2017

I am experiencing the same issue. My validation accuracy on CIFAR 10 is lower with batchnorm than without. I have added tf.GraphKeys.UPDATE_OPS to the optimizer, and set is_training=False during validation. I'm on tensorflow 1.3.

Why is a decay rate required for batch_norm to work? Is there a bug with the batch_norm implementation?

@shahar-scopio
Copy link

Make sure that you collect with tf.GraphKeys.UPDATE_OPS with the right name scope:
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS,name_scope)

@huosan0123
Copy link

change decay=0.999 to 0.9 works fine for me.

@dishank-b
Copy link

@ppwwyyxx . I read the https://github.com/tensorflow/tensorflow/blob/r1.8/tensorflow/python/layers/normalization.py code, to see the implementation of tf.layers.batch_normalization().
But in this code I could not find any control dependency adding for moving mean and variance. There is no line of code which puts the moving average or variance in tf.GraphKeys.UPDATE_OPS collection.

@ppwwyyxx
Copy link
Contributor

ppwwyyxx commented Jun 5, 2018

self.add_update(mean_update, inputs=inputs)
self.add_update(variance_update, inputs=inputs)

@dishank-b
Copy link

dishank-b commented Jun 5, 2018

Can you please refer to function add_update?

@facaiy
Copy link
Member

facaiy commented Jun 6, 2018

I think its add_update method is inherited from base.Layer:

def add_update(self, updates, inputs=None):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stat:community support Status - Community Support type:support Support issues
Projects
None yet
Development

No branches or pull requests

9 participants