Running stats with gradient checkpointing #1035

vovaf709 · 2022-07-20T15:35:02Z

According to patch_batchnorm source code if layer collecting running stats (e.g. BatchNorm) is checkpointed it will accumulate statistics only when grad is enabled (on backward pass). This induces inconsistency:

torch.manual_seed(1337)
seq = nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4))
torch.manual_seed(1337)
seq_checkpointed = checkpoint_wrapper(nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4)))

inp = torch.randn(2, 4, 16, 16)

seq(inp)
seq_checkpointed(inp)

seq[1].running_mean == seq_checkpointed[1].running_mean
tensor([False, False, False, False])

I think this behaviour should be modified to accumulate statistics at 1-st forward pass or at least mentioned in docs

The text was updated successfully, but these errors were encountered:

min-xu-ai · 2022-07-21T06:10:17Z

Thanks for reporting. I slightly modified your code to demonstrate how it works:

import torch
from torch import nn
from fairscale.nn.checkpoint import checkpoint_wrapper

torch.manual_seed(1337)
seq = nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4))
torch.manual_seed(1337)
seq_checkpointed = checkpoint_wrapper(nn.Sequential(nn.Conv2d(4, 4, 3), nn.BatchNorm2d(4)))

inp = torch.randn(2, 4, 16, 16).requires_grad_(True)

out = seq(inp)
out_ck = seq_checkpointed(inp)
torch.testing.assert_close(out, out_ck)

out.sum().backward()
out_ck.sum().backward()


print(seq[1].running_mean)
print(seq_checkpointed[1].running_mean)
torch.testing.assert_close(seq[1].running_mean, seq_checkpointed[1].running_mean)
#tensor([False, False, False, False])

As you can see, you need to run the backward pass to make the running_mean match. Just forward is not enough. Checkpoint_wrapper is used for training. Only doing the forward pass does not make sense IMHO. With backward pass, the stats is matching correctly.

vovaf709 · 2022-07-21T08:49:21Z

There is one kaggle trick - you can run multiple forward passes on test set to adapt running stats to it. This is weird case but still)

min-xu-ai · 2022-07-21T13:38:11Z

Oh I see. That’s interesting! Do you have example code or pseudo code for it?

vovaf709 · 2022-07-22T19:41:26Z

Code for this trick? If yes then it is as simple as

model.train()
for (X, y), _ in zip(test_loader, range(n_iter)):
    model(X)

min-xu-ai · 2022-07-22T19:45:30Z

I see. Then after this loop you proceed with normal training for 1 epoch or the whole training N epochs?

vovaf709 · 2022-07-22T21:04:21Z

No, I run this loop on the test set (on which I want to get the highest target metric in competition) after the whole training. The idea is to adapt BN statistics to the test set which can have slightly different distribution

vovaf709 · 2022-07-22T22:18:41Z

I think I can come up with a solution on the next week, ok?

nyngwang · 2022-08-25T20:27:20Z

@vovaf709 ok

vovaf709 changed the title ~~Runnings stats with gradient checkpointing~~ Running stats with gradient checkpointing Jul 20, 2022

min-xu-ai closed this as completed Jul 21, 2022

min-xu-ai reopened this Jul 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running stats with gradient checkpointing #1035

Running stats with gradient checkpointing #1035

vovaf709 commented Jul 20, 2022 •

edited

min-xu-ai commented Jul 21, 2022

vovaf709 commented Jul 21, 2022

min-xu-ai commented Jul 21, 2022

vovaf709 commented Jul 22, 2022

min-xu-ai commented Jul 22, 2022

vovaf709 commented Jul 22, 2022

vovaf709 commented Jul 22, 2022

nyngwang commented Aug 25, 2022

Running stats with gradient checkpointing #1035

Running stats with gradient checkpointing #1035

Comments

vovaf709 commented Jul 20, 2022 • edited

min-xu-ai commented Jul 21, 2022

vovaf709 commented Jul 21, 2022

min-xu-ai commented Jul 21, 2022

vovaf709 commented Jul 22, 2022

min-xu-ai commented Jul 22, 2022

vovaf709 commented Jul 22, 2022

vovaf709 commented Jul 22, 2022

nyngwang commented Aug 25, 2022

vovaf709 commented Jul 20, 2022 •

edited