Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR fixes two errors I encountered when trying to resume training a model from a checkpoint. The first involves basis expansion, and the second involves batch normalization.
Basis expansion
Here's a simplified example that reproduces the first error I encountered:
Here's the resulting stack trace:
The error ultimately happens because
BlocksBasisExpansion
usesmeshgrid()
to create itsin_indices_*
andout_indices_*
buffers. The significance ofmeshgrid()
is that it uses stride tricks to save space. For example, to make a tensor where every row is the same,meshgrid()
will only allocate a single row, then set the stride such that that same memory gets reused for every row. This ends up causing problems becauseModule.load_state_dict()
usesTensor.copy_()
to copy the checkpointed parameters/buffers back into the module. This operation fails if "more than one element of the written-to tensor refers to a single memory location", i.e. if the destination tensor is using stride tricks.Here's the actual code in question:
escnn/escnn/nn/modules/basismanager/basisexpansion_blocks.py
Lines 128 to 134 in fec08a3
I didn't mention the$\psi_1 \oplus \psi_0 \oplus \psi_1$ representation in the above example specifically so that the stride tricks would be kept, and thus trigger the error.
reshape()
calls above, but they're significant in that they usually—but don't always—get rid of the stride tricks. I constructed theThe obvious solution is to do what the error message suggests, and call
clone()
afterreshape()
. I can confirm that this works, but after looking at the code more closely, I think the best solution is to simply not store these indices in a buffer at all. My understanding is that buffers are for things like the running averages in batch normalization layers: data that aren't subject to optimization (i.e. not parameters), but that still change over the course of a training run and need to be restored from checkpoints. These indices don't ever change, so there's no reason for them to be buffers. They can just be normal object attributes, and the whole problem of loading them from checkpoints goes away.Batch normalization
Here's a simplified example that reproduces the second error I encountered:
To get the error, you first need to run this script with the
-k
option to create a checkpoint. You then need to run the script without the-k
option a couple of times, because the crash isn't deterministic. Eventually, you'll get the following stack trace:This error happens because
_IIDBatchNorm
registers its buffers in a random order. More specifically,_IIDBatchNorm
uses a set to get rid of duplicate representations, and then registers its buffers within a loop over that set. But set iteration order actually changes each time python runs, because python chooses a different random value to incorporate into the hash values of some built-in types each time it starts. This apparently helps protect servers written in python from DOS attacks.The actual crash happens when the optimizer tries to update the parameters after having been restored from a checkpoint. The checkpoint contains some metadata on each parameter, stored in whatever order the parameters were originally generated in. When the optimizer is reconstituted with the parameters in a different order, the result is that checkpointed metadata will get applied to the wrong parameters. The best case scenario at this point is for the program to crash, which happens when the parameters have incompatible dimensions. The worst case scenario is that the program doesn't crash, and instead effectively shuffles the metadata. I believe this will happen if each different representation has the same multiplicity. Models such as the example SE(3) CNN might exhibit this behavior.
The solution is to iterate over the full list of representations, and to manually remove duplicates. This guarantees that the parameters will be generated in the same order every time. While I was making this fix, I noticed the same bug in the
GnormBatchNorm
module, so I fixed it there, too.