Add a note in the docs about the momentum formulation used in optim #1099

keskarnitish · 2017-03-25T03:55:51Z

I have been looking at the implementation of SGD + Momentum in PyTorch and noticed something a bit different from how other packages (and papers) describe it. For the moment, let's focus solely on (classical) momentum and not Nesterov's version.

At the time of writing, the implementation reads:

              if momentum != 0:
                   param_state = self.state[p]
                   if 'momentum_buffer' not in param_state:
                       buf = param_state['momentum_buffer'] = d_p.clone()
                   else:
                       buf = param_state['momentum_buffer']
                       buf.mul_(momentum).add_(1 - dampening, d_p)
                   if nesterov:
                       d_p = d_p.add(momentum, buf)
                   else:
                       d_p = buf

               p.data.add_(-group['lr'], d_p)

Mathematically, if we denote the momentum buffer by v and assume that dampening=0, at every iteration, the buffer is updated as v = m*v + g and the step is ∆x = lr * v. Notice that the learning rate lr hits the momentum term v as well as the gradient. To me, this is different from what classical momentum is, and also differs from how other packages implement SGD+M.

Let us contrast this with the Sutskever et. al. paper and other commonly used pacakges such as Lasagne, Keras, Neon, etc.

Sutskever et. al.

The snippet of the relevant section is pasted below.

Retaining the syntax from above, the algorithm updates v as v = m*v - lr * g with the step ∆x = v. So, the learning rate lr only hits the gradient. It does not (explicitly) influence the effect of the momentum term which is in contrast with PyTorch's implementation.

Lasagne

Lasagne employs the same rule as suggested in Sutskever for momentum.

    for param in params:
        value = param.get_value(borrow=True)
        velocity = theano.shared(np.zeros(value.shape, dtype=value.dtype),
                                 broadcastable=param.broadcastable)
        x = momentum * velocity + updates[param]
        updates[velocity] = x - param

Keras

Same for Keras:

       for p, g, m in zip(params, grads, moments):
            v = self.momentum * m - lr * g  # velocity
            self.updates.append(K.update(m, v))

            if self.nesterov:
                new_p = p + self.momentum * v - lr * g
            else:
                new_p = p + v

Neon

and Neon.

                velocity[:] = self.momentum_coef * velocity - lrate * grad

                # Nesterov accelerated gradient (NAG) is implemented the same
                # as in torch's "sgd.lua". It's a reformulation of Sutskever's
                # NAG equation found in "On the importance of initialization
                # and momentum in deep learning".
                if self.nesterov:
                    param[:] = param + self.momentum_coef * velocity -\
                               lrate * grad
                else:
                    param[:] = param + velocity

Is the disparity true or am I missing something important?

The difference between the two implementations is not insignificant and especially so when lr is reduced along the way. If my claim is true, maybe we could update the reference (I'm not sure what that would be) or include the above version in the SGD code (I can take this up if necessary)?

The text was updated successfully, but these errors were encountered:

colesbury · 2017-03-25T05:34:49Z

For a fixed learning rate, the two formulations are equivalent. The Torch formulation is chosen because the the step size is directly proportional to the learning rate. This means that if you decrease the learning rate, the step size decreases immediately, and not after some number of iterations, which is generally what you want.

keskarnitish · 2017-03-25T20:23:14Z

I agree. My only concern was that, given that the reference for the method is the Sutskever paper and there is no documentation to explain the difference, the current implementation could be a potential "gotcha" for folks moving to PyTorch from other frameworks.

soumith · 2017-04-05T02:25:30Z

@keskarnitish if you send a PR adding a note to the docs, I am happy to merge.

…warnings. Remove component-specific include dirs from include path (pytorch#1099)

…h#1099) * clip before reduce scatter * provide clip before/after RS option * change to clip after ar (avoid confusion) * fix comments

…warnings. Remove component-specific include dirs from include path (pytorch#1099)

apaszke changed the title ~~Implementation of SGD + Momentum~~ Add a note in the docs about the momentum formulation used in optim Mar 25, 2017

apaszke added enhancement labels Mar 25, 2017

keskarnitish added a commit to keskarnitish/pytorch that referenced this issue Apr 5, 2017

Fixing pytorch#1099

d7aaa07

keskarnitish mentioned this issue Apr 5, 2017

Fixing #1099 #1196

Merged

soumith closed this as completed in #1196 Apr 6, 2017

jaglinux pushed a commit to jaglinux/pytorch that referenced this issue Oct 18, 2022

Update hipification logic for all ROCm headers to remove deprecation …

ebd675e

…warnings. Remove component-specific include dirs from include path (pytorch#1099)

dllehr-amd pushed a commit to dllehr-amd/pytorch that referenced this issue Jan 24, 2023

Update hipification logic for all ROCm headers to remove deprecation …

c7f4d8b

…warnings. Remove component-specific include dirs from include path (pytorch#1099)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a note in the docs about the momentum formulation used in optim #1099

Add a note in the docs about the momentum formulation used in optim #1099

keskarnitish commented Mar 25, 2017

colesbury commented Mar 25, 2017

keskarnitish commented Mar 25, 2017

soumith commented Apr 5, 2017

Add a note in the docs about the momentum formulation used in optim #1099

Add a note in the docs about the momentum formulation used in optim #1099

Comments

keskarnitish commented Mar 25, 2017

Sutskever et. al.

Lasagne

Keras

Neon

colesbury commented Mar 25, 2017

keskarnitish commented Mar 25, 2017

soumith commented Apr 5, 2017