Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

some question about the position of 'optimizer.zero_grad()' #238

Open
languandong opened this issue Dec 1, 2021 · 4 comments
Open

some question about the position of 'optimizer.zero_grad()' #238

languandong opened this issue Dec 1, 2021 · 4 comments

Comments

@languandong
Copy link

I think the correct way the code the training
is that

    optimizer.zero_grad()
    # Forward pass
    outputs = model(images)
    loss = criterion(outputs, labels)
    
    # Backward and optimize
    loss.backward()
    optimizer.step()

not that

    # Forward pass
    outputs = model(images)
    loss = criterion(outputs, labels)
    
    # Backward and optimize
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()
@Vandaci
Copy link

Vandaci commented Jul 15, 2022

any difference?

@silky1708
Copy link

@languandong
You can use both, doesn't matter as long as optimizer.zero_grad() is called before loss.backward().
Note that optimizer.zero_grad() zeroes out the gradients in the grad field of the tensors, and loss.backward() compute s the gradients which are then stored in the grad field.

@githraj
Copy link

githraj commented Oct 31, 2023

As pointed out by @languandong, the critical factor is the correct sequence in which optimizer.zero_grad() and loss.backward() are called. Both code snippets are valid as long as optimizer.zero_grad() is invoked before loss.backward(). This ensures that the gradients are properly zeroed out and then computed and stored in the appropriate tensors' grad field.

@luyuwuli
Copy link

@languandong I think the confusion originates from the misconception that the gradient would be computed and stored during the forward pass. In fact, in the forward pass, only the DAG is constructed. The grad is computed in a lazy mode: it is not computed until explicit loss.backward() is invoked.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants