Adding DS Feature API in accelerator #5423

duli2012 · 2024-04-16T20:54:43Z

This PR is a prototype of adding API for capabilities in accelerators including:

define capabilities in abstract_accelerator
set capabilities in cuda_accelerator

Welcome hardware vendors to define capabilities for their own hardware.

… cuda accelerator

into duli/capability

tjruwase · 2024-04-17T15:30:04Z

accelerator/abstract_accelerator.py

@@ -12,7 +12,12 @@ class DeepSpeedAccelerator(ABC):
    def __init__(self):
        self._name = None
        self._communication_backend_name = None
-
+        self._capabilities: dict[str, bool] = {
+            "zero1": False,


These hardcoded constants, such as "zero1", should be defined as symbolic constants in a global location.

@tjruwase As we discussed, I collected the names in *OpBuilder and put it in constants.py. Do we need add other features like optimizers as ds features too?

delock · 2024-04-18T00:55:24Z

Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:

For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.
For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e. op.sparse_attn

DeepSpeed/tests/unit/inference/test_inference.py

Line 336 in b22706a

if not deepspeed.ops.__compatible_ops__[InferenceBuilder.NAME]:
For 1-bit adam, I agree this is a case that could be covered by this interface.

What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.

2. used __compatible_ops__ 3. adding more op name to constants.py

duli2012 · 2024-04-20T00:04:52Z

Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:

For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.

For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e. op.sparse_attn

DeepSpeed/tests/unit/inference/test_inference.py

Line 336 in b22706a

if not deepspeed.ops.__compatible_ops__[InferenceBuilder.NAME]:

For 1-bit adam, I agree this is a case that could be covered by this interface.

What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.

Thanks @delock for your comments. I integrated most of them. please take a look.

2. used __compatible_ops__ 3. adding more op name to constants.py

into duli/capability

rogerxfeng8 · 2024-04-23T01:50:46Z

accelerator/constants.py

+OP_FUSED_LAMB = "fused_lamb"
+OP_FUSED_LION = "fused_lion"
+OP_INFERENCE_CORE_OPS = "inference_core_ops"
+OP_CUTLASS_OPS = "cutlass_ops"


A general name is needed to cover the non-cuda devices?

Hi @rogerxfeng8 - is there a specific one you're referencing? I believe we call all devices (cuda and non-cuda) accelerators.

delock · 2024-04-24T07:00:39Z

Thanks @duli2012 , my intuition is zero 1/2/3 should not among accelerator feature list. Zero stage code is shared between different accelerators and there is no interface specific to zero stage, so I wonder what makes an accelerator not support zero features?

For OP related features i.e. OP_ASYNC_IO, etc. I think there needs to be an mechanism sync them with opbuilder implementation state automatically, otherwise there will be manual maintenance cost each time a new op is introduced.

delock · 2024-04-25T06:35:55Z

accelerator/abstract_accelerator.py


 class DeepSpeedAccelerator(ABC):

    def __init__(self):
        self._name = None
        self._communication_backend_name = None
+        self._ds_features: dict[str, bool] = {ZERO_1: False, ZERO_2: False, ZERO_3: False}
+        self._ds_features.update({op: compatibility for op, compatibility in __compatible_ops__})


This reflection mechanism better be lazy initialized. Otherwise there might be circular dependence because this init function be called before __compatible_ops__ being initialized.

duli2012 added 2 commits April 16, 2024 13:36

adding capabilities api in abstract accelerator and using this API in…

04611e4

… cuda accelerator

Merge branch 'master' into duli/capability

0703d79

duli2012 requested review from tjruwase and loadams April 16, 2024 20:54

loadams and others added 3 commits April 16, 2024 14:00

Merge branch 'master' into duli/capability

7293308

Merge branch 'master' into duli/capability

0acfaf9

Merge branch 'duli/capability' of https://github.com/microsoft/DeepSpeed

e2327a1

into duli/capability

tjruwase reviewed Apr 17, 2024

View reviewed changes

tjruwase requested a review from mrwyattii April 17, 2024 15:33

define symbolic constants

49cbda7

tjruwase mentioned this pull request Apr 17, 2024

Skip 1Bit Compression and sparsegrad tests for HPU. #5270

Merged

duli2012 requested a review from tjruwase April 18, 2024 22:09

1. change the from capability to ds_feature

77567b5

2. used __compatible_ops__ 3. adding more op name to constants.py

duli2012 changed the title ~~Adding capability API in accelerator~~ Adding DS Feature API in accelerator Apr 20, 2024

duli2012 and others added 9 commits April 19, 2024 17:10

1. change the from capability to ds_feature

a103c83

2. used __compatible_ops__ 3. adding more op name to constants.py

Merge branch 'duli/capability' of https://github.com/microsoft/DeepSpeed

0f196a0

into duli/capability

Merge branch 'master' into duli/capability

d571aef

precommit fixes

7662492

fixing build errors

7ccc117

Merge branch 'duli/capability' of https://github.com/microsoft/DeepSpeed

993f35f

into duli/capability

re-add visible_devices_envs

7588dc6

Merge branch 'duli/capability' of https://github.com/microsoft/DeepSpeed

159c8e4

into duli/capability

Merge branch 'duli/capability' of https://github.com/microsoft/DeepSpeed

6ad67c7

into duli/capability

rogerxfeng8 reviewed Apr 23, 2024

View reviewed changes

Merge branch 'master' into duli/capability

89b9f55

delock reviewed Apr 25, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding DS Feature API in accelerator #5423

Adding DS Feature API in accelerator #5423

duli2012 commented Apr 16, 2024

tjruwase Apr 17, 2024

duli2012 Apr 20, 2024

delock commented Apr 18, 2024

duli2012 commented Apr 20, 2024

rogerxfeng8 Apr 23, 2024

loadams Apr 24, 2024

delock commented Apr 24, 2024

delock Apr 25, 2024 •

edited

Adding DS Feature API in accelerator #5423

Are you sure you want to change the base?

Adding DS Feature API in accelerator #5423

Conversation

duli2012 commented Apr 16, 2024

tjruwase Apr 17, 2024

Choose a reason for hiding this comment

duli2012 Apr 20, 2024

Choose a reason for hiding this comment

delock commented Apr 18, 2024

duli2012 commented Apr 20, 2024

rogerxfeng8 Apr 23, 2024

Choose a reason for hiding this comment

loadams Apr 24, 2024

Choose a reason for hiding this comment

delock commented Apr 24, 2024

delock Apr 25, 2024 • edited

Choose a reason for hiding this comment

delock Apr 25, 2024 •

edited