Add DDP support to hivemind.optim #475

borzunov · 2022-05-31T23:26:18Z

Status: This PR is an early draft intended to validate the design of hivemind.DDPOptimizer. I didn't run the code even once yet.

Co-authored-by: @justheuristic

justheuristic · 2022-05-31T23:29:38Z

hivemind/optim/ddp.py

+
+class DDPOptimizer(Optimizer):
+    _DDP_LEADER_RANK = 0
+    _BROADCAST_BUFFER_SIZE = 250 * 1024 ** 2


New pytorch seems to have finally implemented broadcast_coalesced in distributed,
we can directly address this https://pytorch.org/docs/stable/_modules/torch/nn/parallel/comm.html#broadcast_coalesced as long as we bump minimal pytorch version. Wadayathink?

justheuristic · 2022-05-31T23:31:28Z

hivemind/optim/ddp.py

+        return torch.distributed.is_initialized()
+
+    @staticmethod
+    def is_ddp_leader():


nit: would recommend reusing the same terminology as somewhere, such as inside DistributedDataParallel

For instance, the above DDP uses

leader rank -> authoritative rank

is_ddp_enabled -> _initialized

codecov · 2022-05-31T23:38:25Z

Codecov Report

Merging #475 (83ff269) into master (97deaee) will decrease coverage by 0.92%.
The diff coverage is 5.00%.

@@            Coverage Diff             @@
##           master     #475      +/-   ##
==========================================
- Coverage   83.45%   82.53%   -0.93%     
==========================================
  Files          81       82       +1     
  Lines        8083     8175      +92     
==========================================
+ Hits         6746     6747       +1     
- Misses       1337     1428      +91

Impacted Files	Coverage Δ
hivemind/optim/ddp.py	`0.00% <0.00%> (ø)`
hivemind/optim/optimizer.py	`69.40% <100.00%> (-0.26%)`	⬇️
hivemind/optim/state_averager.py	`86.09% <100.00%> (ø)`
hivemind/optim/progress_tracker.py	`97.80% <0.00%> (-1.10%)`	⬇️
hivemind/averaging/matchmaking.py	`84.52% <0.00%> (+0.59%)`	⬆️
hivemind/averaging/averager.py	`89.07% <0.00%> (+0.71%)`	⬆️
hivemind/utils/asyncio.py	`100.00% <0.00%> (+0.86%)`	⬆️

justheuristic · 2022-05-31T23:39:08Z

hivemind/optim/ddp.py

+        return self.is_ddp_leader() and super().is_alive()
+
+    def _compute_state_version(self) -> int:
+        """Return a non-decreasing integer that goes up whenever model params and/or buffers were updated"""


This function is meant as a workaround to catch the moment when optimizer has updated parameters (load from peers, apply optimizer step average params)

All changes to state are currently handled in StateAverager.
Maybe we can implement StateAverager.local_version that gets incremented every time StateAverager loads, averages or updates state by optimizer

justheuristic · 2022-05-31T23:41:34Z

hivemind/optim/ddp.py

+        if self.is_ddp_leader():
+            super().load_state_from_peers(**kwargs)
+
+        self._sync_among_ddp_ranks()


We should not synchronize here: non-master ranks cannot call this and we will deadlock.

We should only sync in step -- and after step check IF master updated/loaded/averaged step and then broadcast.

justheuristic · 2022-05-31T23:42:05Z

hivemind/optim/ddp.py

+        if self.is_ddp_leader():
+            super().load_state_dict(state_dict)
+
+        self._sync_among_ddp_ranks()


We should not synchronize here: non-master ranks cannot call this and we will deadlock: see load_state_from_peers

justheuristic · 2022-05-31T23:42:50Z

hivemind/optim/ddp.py

+
+    def shutdown(self):
+        if self.is_ddp_leader():
+            super().shutdown()


Optional: else raise NotImplemented or warn?

justheuristic · 2022-05-31T23:44:25Z

hivemind/optim/ddp.py

+
+    def is_alive(self) -> bool:
+        # On followers, this always returns False since there's nothing to shut down in __del__()
+        return self.is_ddp_leader() and super().is_alive()


if leader:
return is_alive
else:
raise NotImplementedError?

borzunov · 2022-06-01T00:06:32Z

hivemind/optim/state_averager.py

@@ -131,10 +131,10 @@ def __init__(
        )

    @staticmethod
-    def _check_params(
+    def check_params(


Suggested change

def check_params(

def prepare_params(

borzunov · 2022-06-09T01:30:47Z

hivemind/optim/ddp.py

+logger = get_logger(__name__)
+
+
+class DDPOptimizer(Optimizer):


Note to self: A better way to do it is:

Don't inherit hivemind.Optimizer

Make _create_optimizer() method and forward __init__'s kwargs there

Make opt property

Maybe create __getattr__ that can forward attrs to opt

borzunov · 2022-06-09T01:30:47Z

hivemind/optim/ddp.py

+logger = get_logger(__name__)
+
+
+class DDPOptimizer(Optimizer):


Note to self: A better way to do it is:

Don't inherit hivemind.Optimizer

Make _create_optimizer() method and forward __init__'s kwargs there

Make opt property

Maybe create __getattr__ that can forward attrs to opt

Write a DDPOptimizer draft

d0077dc

justheuristic reviewed May 31, 2022

View reviewed changes

Fix early draft issues

83ff269

borzunov commented Jun 1, 2022

View reviewed changes

borzunov commented Jun 9, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DDP support to hivemind.optim #475

Add DDP support to hivemind.optim #475

borzunov commented May 31, 2022 •

edited

justheuristic May 31, 2022

justheuristic May 31, 2022

codecov bot commented May 31, 2022 •

edited

justheuristic May 31, 2022

justheuristic May 31, 2022

justheuristic May 31, 2022

justheuristic May 31, 2022 •

edited

justheuristic May 31, 2022

borzunov Jun 1, 2022

borzunov Jun 9, 2022

borzunov Jun 9, 2022

		logger = get_logger(__name__)


		class DDPOptimizer(Optimizer):

		logger = get_logger(__name__)


		class DDPOptimizer(Optimizer):

Add DDP support to hivemind.optim #475

Are you sure you want to change the base?

Add DDP support to hivemind.optim #475

Conversation

borzunov commented May 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented May 31, 2022 • edited

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justheuristic May 31, 2022 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

borzunov commented May 31, 2022 •

edited

codecov bot commented May 31, 2022 •

edited

justheuristic May 31, 2022 •

edited