Distributed training setup #4

MicPie · 2022-01-12T20:07:14Z

PR for the distributed training setup.

…ytorch (state commit: 5a255eab032bcd821c2038c808b9682e485b3f1a)

…nup.

… into distributed_training_setup

MicPie · 2022-01-12T20:07:55Z

Packages I currently work on:

grad cache
pytorch AMP FP16 training
lr schedule

Other packages that will be needed:

review and check (web)dataset setup incl. text mask output and validation dataset
add accuracy logging
add ImageNet eval
8bit adam/zero optimizer
test horovod training if needed
test deep speed training if needed
see small TO DOs in the code base

Other stuff:

add Hopfield network for CLOOB (InfoLOOB is there)

MicPie added 27 commits December 18, 2021 16:42

Adaptations CLIP model class for ddp training 1.

7833706

Draft ddp training 1.

88bb5d9

Draft ddp training 2 incl. dumming dataset and bug fixes.

f037153

Draft ddp training 3 incl. webdataset and fixes.

cf83155

Fix data timing measurement point.

211b549

Add checkdataloading argument to debug dataloading.

db3a65d

Added seeding, use of all latents and minor cleanups.

0ac974e

Added code for testing with all latents for debugging.

9ad4fa6

Added code for testing with all latents for debugging 2.

286f850

Added distributed_backends from https://github.com/lucidrains/DALLE-p…

4880436

…ytorch (state commit: 5a255eab032bcd821c2038c808b9682e485b3f1a)

Backup all_gather backprop setups for transfer.

31f58ac

fix for merge

cf7327a

Merge branch 'lucidrains-main' into hvd

142d79d

Added pytorch ddp backend.

874ca70

Adapted pytorch ddp backend class name and added class to __init__.

c880877

update pytorch ddp setup for multi-gpu tests

dfb5c99

update pytorch ddp setup for multi-gpu tests 2

20c7797

update pytorch ddp setup for multi-gpu tests 2

c988f5f

update pytorch ddp setup for multi-gpu tests 3

50a994a

update pytorch ddp setup for multi-gpu tests 4

34558ea

update pytorch ddp setup for multi-gpu tests 3

117ba82

cleanup

7feff0f

setup for distributed abstraction training

e39cba5

Update x-clip to only use local samples for the loss and general clea…

6b086b8

…nup.

big cleanup and grad cache test setup

095bd0e

distributed training setup 1

ec28d1c

Merge branch 'distributed_training_setup' of github.com:MicPie/x-clip…

5338ecb

… into distributed_training_setup

lucidrains force-pushed the main branch 2 times, most recently from 67aa6e6 to a846b2f Compare September 4, 2023 23:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training setup #4

Distributed training setup #4

MicPie commented Jan 12, 2022

MicPie commented Jan 12, 2022 •

edited

Distributed training setup #4

Are you sure you want to change the base?

Distributed training setup #4

Conversation

MicPie commented Jan 12, 2022

MicPie commented Jan 12, 2022 • edited

MicPie commented Jan 12, 2022 •

edited