Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to use gradient accumulate in BytePS torch DDP? #417

Open
wuyujiji opened this issue Nov 2, 2021 · 5 comments
Open

How to use gradient accumulate in BytePS torch DDP? #417

wuyujiji opened this issue Nov 2, 2021 · 5 comments
Labels
enhancement New feature or request

Comments

@wuyujiji
Copy link

wuyujiji commented Nov 2, 2021

Did you have the demo for gradient accumulate in BytePS torch DDP? I can not find it in byteps/torch/example.

@aDecisionTree
Copy link

I'm also interested in this~

@ymjiang ymjiang added the enhancement New feature or request label Nov 3, 2021
@ymjiang
Copy link
Member

ymjiang commented Nov 3, 2021

bps.DistributedOptimizer supports gradient accumulation with the backward_passes_per_step option.

bps.DistributedDataParallel does not support it for now. We will add this feature.

@wuyujiji
Copy link
Author

wuyujiji commented Nov 3, 2021

Could you please share the entire gradient accumulate demo for bps.DistributedOptimizer?

@ymjiang
Copy link
Member

ymjiang commented Nov 3, 2021

Here is a general workflow:

optimizer = bps.DistributedOptimizer(optimizer)
optimizer.set_backward_passes_per_step(accumulation_steps)
model.zero_grad()                               
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                   
    loss = loss_function(predictions, labels)       
    loss = loss / accumulation_steps               # optional
    loss.backward()
    if (i+1) % accumulation_steps == 0:          
        optimizer.step()                            
        model.zero_grad()                         

We will consider adding an example later.

@wuyujiji
Copy link
Author

wuyujiji commented Nov 3, 2021

Here is a general workflow:

optimizer = bps.DistributedOptimizer(optimizer)
optimizer.set_backward_passes_per_step(accumulation_steps)
model.zero_grad()                               
for i, (inputs, labels) in enumerate(training_set):
    predictions = model(inputs)                   
    loss = loss_function(predictions, labels)       
    loss = loss / accumulation_steps               # optional
    loss.backward()
    if (i+1) % accumulation_steps == 0:          
        optimizer.step()                            
        model.zero_grad()                         

We will consider adding an example later.

Thanks for replying quickly! If I want to use torch.cuda.amp in above code, how did I further add it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants