You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I use swa to train my network recently, and the task is Re-ID. But I can not see obvious improvement (actually, almost the same w/o swa) when training network with Adam.
So, can we use Adam or other optimizer instead of SGD to train the networks, if we want to improve our networks with swa?
The text was updated successfully, but these errors were encountered:
Hi, sorry for delayed response. In my experience SWA works best with SGD. Adam sets the learning rates adaptively, which is not ideal for SWA. However, we did see some improvement with other optimizers as well. I recommend trying to tune the learning rate schedule (try increasing the learning rates during the SWA stage), or maybe switching to SGD for the SWA stage.
As I see in TensorFlow Adam have trainable parameters, so the question is should we exclude these parameters from averaging? Same question for BN trainable parameters.
Hey @mrgloom. The adam parameters and BN parameters are not trainable parameters of the network. In fact, the former are tensors stored in the optimizer state, and the latter are buffers of the model. They should not be averaged. However, you need to fix the batchnorm statistics for the SWA model in the end of training (https://pytorch.org/blog/stochastic-weight-averaging-in-pytorch/#batch-normalization)
Hi, I use swa to train my network recently, and the task is Re-ID. But I can not see obvious improvement (actually, almost the same w/o swa) when training network with Adam.
So, can we use Adam or other optimizer instead of SGD to train the networks, if we want to improve our networks with swa?
The text was updated successfully, but these errors were encountered: