[REP] AWS accelerators trn1_inf support #39

chappidim · 2023-07-27T23:58:48Z

Details

This enhancement proposal briefly talks about AWS accelerators (Trainium/Inferentia) support on Ray.

+1 to that, I'm open for adding new predefined resource-type if aws_accelerators doesn't fit as GPU. The only counterpart to that is per this docs we treat any other accelerator as GPU.
AWS expects users to install a custom built torch_xla. Implies, torch_xla_neuron may not support some of the APIs but the package follows all the API standards listed by open-source torch-xla.

matthewdeng · 2023-08-01T07:28:12Z

reps/2023-07-27-aws-accelerator-support.md

+TorchBackend is the communication for TorchTrainer and it supports limited backends (nccl, gloo) today.
+In order to support NeuronCore we would use PythonXLA framework and configure the backend to XLA.
+Also, requires additional configuration of torch-elastic (now called tourchrun) environment variables
+for the XLA devices to detect.
+
+```text
+class _TorchBackend(Backend):
+    def on_start():
+        # support xla backend
+        # Configure master env of xla device related to torchrun/torch-elastic
+    def on_shutdown():
+        # cleanup NeuronCore cache if needed
+    def on_training_start():
+        # configure rank/world_size/node_rank based on xla device


I'm a bit fuzzy on whether XLA is expected to sit at the same level in the stack as nccl and gloo, based on my knowledge of Torch backends.

As an alternative approach, would the following interface make logical sense? Is this the right layer of abstraction?

class TorchXLAConfig(TorchConfig) @property def backend_cls(self): return _TorchXLABackend class _TorchXLABackend(_TorchBackend): # XLA specific logic here # User defined code trainer = TorchTrainer(torch_config=TorchXLAConfig(...))

To better understand how to think about this, I'd love to learn more about how Torch XLA environment are typically set up and configured in practice - do you have any pointers to any best practices or other references I could take a look at?

I'm a newbie to this space, as I learn more it makes sense to have separate XLAConfig with Backend. Wondering if we want to be more explicit on backend as it can vary per XLA device? TorchXLAConfig/_TorchAwsNeuronXLABackend where the basic setup (Rank/WorldSize,MasterPort/Addr) is already done by NeuronSDK [1] and include anything related to torchrun[2].

I'm happy to ask around some SMEs around this area but here's the information I gathered so far.

Configure TPU library configuration [3][4]

Configure pjrt(latest)/xrt [5]

Configure world/local, rank/size, master add/port (for torchrun) - generic to torch

[1] https://sagemaker.readthedocs.io/en/v2.167.0/frameworks/pytorch/using_pytorch.html#distributed-training-with-pytorch-neuron-on-trn1-instances
[2] https://awsdocs-neuron.readthedocs-hosted.com/en/latest/frameworks/torch/torch-neuronx/tutorials/training/mlp.html?highlight=torchrun#multi-worker-data-parallel-mlp-training-using-torchrun
[3] https://github.com/pytorch/xla/blob/master/TROUBLESHOOTING.md#troubleshooting
[4] https://lightning.ai/docs/pytorch/stable/accelerators/tpu_faq.html
[5] https://github.com/pytorch/xla/blob/424d8c8eec4e7fd3af83749da2652b2fcd6d207a/configuration.yaml

chappidim · 2023-08-02T18:47:07Z

looks there are two main questions:

should we use a different resource name other than GPU?

this is mainly because ray train code are expected no code change if GPU is available. it seems not the case for AWS’s accelator (requries code change). If we want to stick with GPU, we should have a way to differentiate nvidia GPU from other xla accelators.

does all xla devices follow the same pytorch API?

this is somewhat related to the previous question. I assume it's the case, and we should make sure the API works out of box for other xla devices, like TPU

@scv119 Checking if we got enough quorum on adding num_neuron_cores as pre-defined resource? Thanks

AWS accelerators trn1_inf support

ac0e184

Signed-off-by: maheedhar reddy chappidi <chappidm@amazon.com>

scv119 assigned matthewdeng, scv119 and cadedaniel Jul 28, 2023

matthewdeng reviewed Aug 1, 2023

View reviewed changes

chappidim mentioned this pull request Aug 1, 2023

Auto-detection of accelerator_type for aws_accelerators trn1_inf ray-project/ray#37998

Merged

7 tasks

pdames mentioned this pull request Aug 15, 2023

[Train] Ray Train should support AWS trainium instances ray-project/ray#33504

Open

cadedaniel mentioned this pull request Aug 22, 2023

[Core] Support Intel GPU ray-project/ray#36493

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REP] AWS accelerators trn1_inf support #39

[REP] AWS accelerators trn1_inf support #39

chappidim commented Jul 27, 2023 •

edited

scv119 commented Jul 28, 2023

scv119 commented Jul 30, 2023 •

edited

chappidim commented Jul 31, 2023

matthewdeng Aug 1, 2023

chappidim Aug 1, 2023

chappidim commented Aug 2, 2023

[REP] AWS accelerators trn1_inf support #39

Are you sure you want to change the base?

[REP] AWS accelerators trn1_inf support #39

Conversation

chappidim commented Jul 27, 2023 • edited

Details

Related

scv119 commented Jul 28, 2023

scv119 commented Jul 30, 2023 • edited

chappidim commented Jul 31, 2023

matthewdeng Aug 1, 2023

Choose a reason for hiding this comment

chappidim Aug 1, 2023

Choose a reason for hiding this comment

chappidim commented Aug 2, 2023

chappidim commented Jul 27, 2023 •

edited

scv119 commented Jul 30, 2023 •

edited