CPUAdam fp16 and bf16 support #5409

BacharL · 2024-04-14T12:27:28Z

Hi.
Please review the following changes
I added support for BF16 to cpu adam. BF16, FP16 and float are supported at compilation time. the correct template is called at runtime according to input params dtype.

csrc/includes/cpu_adagrad.h

deepspeed/ops/adagrad/cpu_adagrad.py

op_builder/cpu/cpu_adam.py

tjruwase · 2024-05-04T17:07:47Z

csrc/includes/cpu_adam.h

 #endif

+typedef HALF_DTYPE ds_half_precision_t;


@BacharL, this amazing PR of yours has made me realize that ds_half_precision_t was not well thought out for optimizer offloading. It did not anticipate that device type could be anything other than fp16 and bf16. Consequently, we now require users to set a confusing compiler option: -DHALF_DTYPE=float.

In retrospect, it seems ds_device_precision_t is a better name, and compiler option could be -DDEVICE_DTYPE=float

Similarly, the half_precision bool variables scattered around the code is also redundant. Thanks for removing some of them in this PR.

@BacharL, what do you think?

I will fix the naming of these, however, I think ds_half_precision_t causes discrepancy in buffer handling.
Step_AVX will handle the buffer with on cpu with FP16 of BF16 as implemented in AVX. The cast to fp32 is done in simd.h SIMD_LOAD_XX macros.
But the remaining part of the buffer will be handled in Step_1 with dtype from the device, which may be __half or __bfloat16. This dtype may not provide bit exact results as CPU implementation. The cast to fp32 is done in Step_1 inside the main loop.
I think even AVX/Native C++ cast may not be compatible. So if we already have this issue, why do we need to add device dtypes here? we can limit this code to c10::Half or c10::BFloat16 only.

We can also remove more half_precision varaibles by passing _params and dev_params with correct type instead of float* and fix their usage inside cuda and avx code

Thanks for raising questions about the discrepancies in this code base. I am aligned that the code is unduly complicated to the point of being bug prone. I think this occurred because of the uncontrolled evolution of the code. Hopefully, some measure of cleanliness can be restored through careful usage of templates, and torch data type support. In the meantime, I will answer your question below.

I think even AVX/Native C++ cast may not be compatible. So if we already have this issue, why do we need to add device dtypes here? we can limit this code to c10::Half or c10::BFloat16 only

I think adding device dtype is useful for making the data type conversions of offloading computation explicit.

Forward/backward computation on the device using device_dtype (currently fp16, bf16, or fp32) with output of gradients in device_dtype.

Conversion step of gradients from device_dtype to host_dtype which is input for subsequent optimizer computation on host.

Optimizer step computation on host using host_dtype (currently fp32) and returning param of host_dtype.

Conversion of (updated) host_dtype param into device_dtype param which is input for subsequent forward/backward computation on device.

op_builder/cpu/fused_adam.py

tjruwase · 2024-05-04T17:13:02Z

@BacharL, thanks for this incredible improvement to the offloading optimizers and op builders. I left a few comments and questions, but overall looks good to me.

Change-Id: I30df727076f25bcb95c5c16bce960b38950c8eb1

Change-Id: I6edda27251ffd09d514d8bc0ee9f37b5101e9508

Change-Id: I0954147a55d3687cf1dadcf8e739d6ec968ffb79

Change-Id: I327cfedd529acec4371e1a091730ff45b9275363

Change-Id: Iae33f22a021a9c4c521e82981617617e8fad6f6e

Change-Id: I1de599a4398f3fe38a4056b9bfc0ca9fbf06f4aa

cpu adam will use dtype from input tensor no need for HALF_DTYPE define during compilation Change-Id: I069f994d5229e88e75d092e2236b5fbafd8db994

Change-Id: Iad82ee3f12d6f2ba460b6f013b557836404583c3

Change-Id: I75c715cdbaeab7ff89c1dab2b34e3340713ee650

Change-Id: I26064fd65c8708e6944c83ababe00147bfe62967

Change-Id: I8d77e52aba3072151e337b8453684fe3ee0f873d

Change-Id: Idd23dd7aeba7c9da80e1f3a1f3ec307c033b4c7c

dispatch map to select template Change-Id: I9755d9e5bd43a94f7d7a521f1414b908455c510f

Change-Id: Icb880493fc63bd784b1e299a4e06348f25b74544

Change-Id: I34a4a0f866cd6b1884202055866ffceb5a7d0da2

Change-Id: I91d9eb5c209bff59426fcfcc42a68376ee67d8e0

Change-Id: Ic93a3e630083c110bb70373a4e5f3b364986a436

Change-Id: I87ab075bc53a94a8cbfa4daddc11170b8ee13e95

BacharL · 2024-05-08T11:20:57Z

Added templated invoker to help selecting the implementation
The map stores function pointers to templated functions, the key is the type enum. At initialization all supported dtypes are templated and inserted into the map.
I didn't clean ds_adagrad_step_plus_copy and related code under __ENABLE_CUDA__ but also couldn't test it.

op_builder/hpu/fused_adam.py

This reverts commit f18eef3.

Change-Id: I99b134c5de26f7e2a227d47bd84d0a5070b786b5

Change-Id: Iec76ef86c2c5ebd17138f16f0d14b9b62138c7ee

Change-Id: I6205b2a8ca53cf09f1d6c6c6123e242b8fabcf56

Change-Id: I57c175f61644a146a83310ecfbfc86fc638cc440

Change-Id: Ie9feb6049e20693c330b2c65e3b4fa77f29adea6

Change-Id: I79bcd63552820182b20e1def8a2e0b2f4490f706

Change-Id: I144160a40a725a71a02cf5bdb54f93cb5abdca43

Change-Id: Id2efe0143d092eeb893b52f68ed55e9c90a1b3c6

tests/unit/ops/adam/test_cpu_adam.py

Change-Id: Ic3867ed78a636e88b884f304b68e897650a47ddc

Change-Id: Ibbeaab192f2bc1771bcdd40b4927290076c9ca81

Change-Id: I3c1ebedfa433553138cb7ddf04dac6bdcbc0c68c

BacharL requested review from mrwyattii, awan-10 and arashb as code owners April 14, 2024 12:27

BacharL marked this pull request as draft April 15, 2024 19:57

BacharL force-pushed the hab_cpu_adam branch 2 times, most recently from a9d5b2c to 11ddda8 Compare April 17, 2024 11:58

BacharL changed the title ~~[SW-173858] CPUAdam fp16 and bf16 support~~ CPUAdam fp16 and bf16 support Apr 17, 2024

BacharL force-pushed the hab_cpu_adam branch from 11ddda8 to 8b99fe5 Compare April 30, 2024 08:24

BacharL marked this pull request as ready for review May 2, 2024 07:06

BacharL requested review from tjruwase and loadams as code owners May 2, 2024 07:06

tjruwase reviewed May 3, 2024

View reviewed changes

csrc/includes/cpu_adagrad.h Outdated Show resolved Hide resolved

csrc/includes/cpu_adagrad.h Outdated Show resolved Hide resolved

tjruwase reviewed May 4, 2024

View reviewed changes

csrc/includes/cpu_adagrad.h Outdated Show resolved Hide resolved

tjruwase reviewed May 4, 2024

View reviewed changes

deepspeed/ops/adagrad/cpu_adagrad.py Outdated Show resolved Hide resolved

tjruwase reviewed May 4, 2024

View reviewed changes

op_builder/cpu/cpu_adam.py Outdated Show resolved Hide resolved

tjruwase reviewed May 4, 2024

View reviewed changes

op_builder/cpu/fused_adam.py Outdated Show resolved Hide resolved

BacharL added 12 commits May 5, 2024 17:21

[SW-0] allow running tests on simulator

65faa95

Change-Id: I30df727076f25bcb95c5c16bce960b38950c8eb1

fused adam for hpu

f18eef3

Change-Id: I6edda27251ffd09d514d8bc0ee9f37b5101e9508

CPUAdam fp16 and bf16 support

c389971

Change-Id: I0954147a55d3687cf1dadcf8e739d6ec968ffb79

add missing functions

04a95e5

Change-Id: I327cfedd529acec4371e1a091730ff45b9275363

remvoe set_dtype function, move dtype argument to constructor

4613804

Change-Id: Iae33f22a021a9c4c521e82981617617e8fad6f6e

fix dead code

4f13b2e

Change-Id: I1de599a4398f3fe38a4056b9bfc0ca9fbf06f4aa

cleanup half_precision

fd9901d

cpu adam will use dtype from input tensor no need for HALF_DTYPE define during compilation Change-Id: I069f994d5229e88e75d092e2236b5fbafd8db994

remove HALF_DTYPE compiler define

9b3151d

Change-Id: Iad82ee3f12d6f2ba460b6f013b557836404583c3

apply changes to cpulion and cpuadagrad

f940312

Change-Id: I75c715cdbaeab7ff89c1dab2b34e3340713ee650

fix compile errors

beb0d6c

Change-Id: I26064fd65c8708e6944c83ababe00147bfe62967

fix compile errors

f7ee43c

Change-Id: I8d77e52aba3072151e337b8453684fe3ee0f873d

pre commit

d8efc50

Change-Id: Idd23dd7aeba7c9da80e1f3a1f3ec307c033b4c7c

BacharL added 6 commits May 8, 2024 11:39

cpu adam templated param,state and device types

d9d2188

dispatch map to select template Change-Id: I9755d9e5bd43a94f7d7a521f1414b908455c510f

fix cuda build

4d150cf

Change-Id: Icb880493fc63bd784b1e299a4e06348f25b74544

pre commit

930ca33

Change-Id: I34a4a0f866cd6b1884202055866ffceb5a7d0da2

apply changes to adagrad and lion

468a314

Change-Id: I91d9eb5c209bff59426fcfcc42a68376ee67d8e0

pre commit

2987e5f

Change-Id: Ic93a3e630083c110bb70373a4e5f3b364986a436

fix typos

3c86804

Change-Id: I87ab075bc53a94a8cbfa4daddc11170b8ee13e95

tjruwase reviewed May 8, 2024

View reviewed changes

op_builder/hpu/fused_adam.py Outdated Show resolved Hide resolved

tjruwase and others added 11 commits May 8, 2024 11:43

Merge branch 'master' into hab_cpu_adam

cebcd21

Revert "fused adam for hpu"

a690a0d

This reverts commit f18eef3.

cleanup device params and cuda specific code from cpuadam

f518a70

Change-Id: I99b134c5de26f7e2a227d47bd84d0a5070b786b5

cleanup cuda code from cpulion and cpuadagrad

79477f3

Change-Id: Iec76ef86c2c5ebd17138f16f0d14b9b62138c7ee

fix builders

023d1f8

Change-Id: I6205b2a8ca53cf09f1d6c6c6123e242b8fabcf56

pre commit

005c7cc

Change-Id: I57c175f61644a146a83310ecfbfc86fc638cc440

fix builders

9b11559

Change-Id: Ie9feb6049e20693c330b2c65e3b4fa77f29adea6

fix cpulion build

48ac129

Change-Id: I79bcd63552820182b20e1def8a2e0b2f4490f706

remove unused function declerations

abc5b2b

Change-Id: I144160a40a725a71a02cf5bdb54f93cb5abdca43

Merge branch 'master' into hab_cpu_adam

6bd525a

fix bf16 to numpy conversion in tests

8aa6a20

Change-Id: Id2efe0143d092eeb893b52f68ed55e9c90a1b3c6

BacharL commented May 12, 2024

View reviewed changes

tests/unit/ops/adam/test_cpu_adam.py Show resolved Hide resolved

tjruwase and others added 2 commits May 13, 2024 06:28

Merge branch 'master' into hab_cpu_adam

cbef0bf

skip test_torch_adamw_equal for invalid parameter combinations

23625cf

Change-Id: Ic3867ed78a636e88b884f304b68e897650a47ddc

BacharL force-pushed the hab_cpu_adam branch from badfdee to 23625cf Compare May 13, 2024 12:39

BacharL and others added 3 commits May 19, 2024 12:04

remove fp16_param_groups argument

e8e1b25

Change-Id: Ibbeaab192f2bc1771bcdd40b4927290076c9ca81

pre commit

0f5ffc3

Change-Id: I3c1ebedfa433553138cb7ddf04dac6bdcbc0c68c

Merge branch 'master' into hab_cpu_adam

62975f2

tjruwase approved these changes May 20, 2024

View reviewed changes

tjruwase added this pull request to the merge queue May 20, 2024

Merged via the queue into microsoft:master with commit 69af361 May 20, 2024
13 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CPUAdam fp16 and bf16 support #5409

CPUAdam fp16 and bf16 support #5409

BacharL commented Apr 14, 2024 •

edited

tjruwase May 4, 2024 •

edited

BacharL May 5, 2024

tjruwase May 6, 2024

tjruwase commented May 4, 2024

BacharL commented May 8, 2024

CPUAdam fp16 and bf16 support #5409

CPUAdam fp16 and bf16 support #5409

Conversation

BacharL commented Apr 14, 2024 • edited

tjruwase May 4, 2024 • edited

Choose a reason for hiding this comment

BacharL May 5, 2024

Choose a reason for hiding this comment

tjruwase May 6, 2024

Choose a reason for hiding this comment

tjruwase commented May 4, 2024

BacharL commented May 8, 2024

BacharL commented Apr 14, 2024 •

edited

tjruwase May 4, 2024 •

edited