Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding DS Feature API in accelerator #5423

Open
wants to merge 17 commits into
base: master
Choose a base branch
from
Open

Adding DS Feature API in accelerator #5423

wants to merge 17 commits into from

Conversation

duli2012
Copy link
Member

This PR is a prototype of adding API for capabilities in accelerators including:

  1. define capabilities in abstract_accelerator
  2. set capabilities in cuda_accelerator

Welcome hardware vendors to define capabilities for their own hardware.

@@ -12,7 +12,12 @@ class DeepSpeedAccelerator(ABC):
def __init__(self):
self._name = None
self._communication_backend_name = None

self._capabilities: dict[str, bool] = {
"zero1": False,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These hardcoded constants, such as "zero1", should be defined as symbolic constants in a global location.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tjruwase As we discussed, I collected the names in *OpBuilder and put it in constants.py. Do we need add other features like optimizers as ds features too?

@tjruwase tjruwase requested a review from mrwyattii April 17, 2024 15:33
@delock
Copy link
Contributor

delock commented Apr 18, 2024

Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:

  1. For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.
  2. For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e. op.sparse_attn
    if not deepspeed.ops.__compatible_ops__[InferenceBuilder.NAME]:
  3. For 1-bit adam, I agree this is a case that could be covered by this interface.

What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.

@duli2012 duli2012 requested a review from tjruwase April 18, 2024 22:09
2. used __compatible_ops__
3. adding more op name to constants.py
@duli2012 duli2012 changed the title Adding capability API in accelerator Adding DS Feature API in accelerator Apr 20, 2024
@duli2012
Copy link
Member Author

Hi @duli2012 thanks for adding this interface. I have always been worring accelerator interface# may grow too big when we propose more and more capabilities into it, this interface is a good way to put all capabilities into one interface. My comments below:

  1. For zero1/2/3, it sounds like accelerator agnostic features. Basically it should be supported if accelerator runtime and communication collectives is well implemented.
  2. For sparse_attn, it is in the category of whether certain OpBuilder is implemented. Previous way of telling whether ops is implemented is through compatible ops as the following link. I think its better to move this capability group into capability interface since the new interfacve look more intuitive to use. Is it possible to make it auto reflect accelerator OpBuilder implementation so there will be less maintentance work? A minor suggestion is name capability in this group with same prefix i.e. op.sparse_attn
    if not deepspeed.ops.__compatible_ops__[InferenceBuilder.NAME]:
  3. For 1-bit adam, I agree this is a case that could be covered by this interface.

What should we do with already existing accelerator interface that falls into category of capabilities? Should they be added to the new inferface or just keep them that way? I think that's an open to discuss.

Thanks @delock for your comments. I integrated most of them. please take a look.

OP_FUSED_LAMB = "fused_lamb"
OP_FUSED_LION = "fused_lion"
OP_INFERENCE_CORE_OPS = "inference_core_ops"
OP_CUTLASS_OPS = "cutlass_ops"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A general name is needed to cover the non-cuda devices?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @rogerxfeng8 - is there a specific one you're referencing? I believe we call all devices (cuda and non-cuda) accelerators.

@delock
Copy link
Contributor

delock commented Apr 24, 2024

Thanks @duli2012 , my intuition is zero 1/2/3 should not among accelerator feature list. Zero stage code is shared between different accelerators and there is no interface specific to zero stage, so I wonder what makes an accelerator not support zero features?

For OP related features i.e. OP_ASYNC_IO, etc. I think there needs to be an mechanism sync them with opbuilder implementation state automatically, otherwise there will be manual maintenance cost each time a new op is introduced.


class DeepSpeedAccelerator(ABC):

def __init__(self):
self._name = None
self._communication_backend_name = None
self._ds_features: dict[str, bool] = {ZERO_1: False, ZERO_2: False, ZERO_3: False}
self._ds_features.update({op: compatibility for op, compatibility in __compatible_ops__})
Copy link
Contributor

@delock delock Apr 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reflection mechanism better be lazy initialized. Otherwise there might be circular dependence because this init function be called before __compatible_ops__ being initialized.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants