You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a question regarding the number of recycling iterations used during training. In the AF2 paper they mention that the number of recycling iterations are a "shared value across the batch". However, from what I can tell batch level attributes during distributed training are actually defined at the micro-batch level here:
From my understanding, in both DDP and DeepSpeed each batch is split into micro-batches that are each sent to one GPU. The issue is that the batch splitting occurs in the DistributedSampler before it even gets to the OpenFoldDataLoader. Ergo, all these properties that should be fixed at the batch-level are actually defined at the micro-batch level, meaning that each GPU process could be running a different number of recycling iterations. Please let me know if I am reading this incorrectly, but apart from not matching the paper wouldn't this be extremely wasteful as all GPUs would have to wait for the micro-batch with the largest recycling_iters?
For DDP we could simply use the broadcast api to send the recycling_iters from rank 0 to the rest of the processes. Looking at the DeepSpeedStrategy code from lightning it seems that it inherits the DDPStrategy class, along with the broadcast method. The inherited method is actually used throughout the DeepSpeedStrategy class so we should be fine to use it for both distributed training strategies.
Thanks for your help in advance :)
The text was updated successfully, but these errors were encountered:
I have a question regarding the number of recycling iterations used during training. In the AF2 paper they mention that the number of recycling iterations are a "shared value across the batch". However, from what I can tell batch level attributes during distributed training are actually defined at the micro-batch level here:
openfold/openfold/data/data_modules.py
Lines 800 to 836 in ef0c9fa
From my understanding, in both
DDP
andDeepSpeed
each batch is split into micro-batches that are each sent to one GPU. The issue is that the batch splitting occurs in theDistributedSampler
before it even gets to theOpenFoldDataLoader
. Ergo, all these properties that should be fixed at the batch-level are actually defined at the micro-batch level, meaning that each GPU process could be running a different number of recycling iterations. Please let me know if I am reading this incorrectly, but apart from not matching the paper wouldn't this be extremely wasteful as all GPUs would have to wait for the micro-batch with the largestrecycling_iters
?For
DDP
we could simply use thebroadcast
api to send therecycling_iters
from rank 0 to the rest of the processes. Looking at theDeepSpeedStrategy
code from lightning it seems that it inherits theDDPStrategy
class, along with thebroadcast
method. The inherited method is actually used throughout theDeepSpeedStrategy
class so we should be fine to use it for both distributed training strategies.Thanks for your help in advance :)
The text was updated successfully, but these errors were encountered: