WIP: Support Tensorflow distributed training in kfp workflow #3000

typhoonzero · 2022-11-07T03:06:43Z

What changes were proposed in this pull request?

Support running distributed Tensorflow training in kfp workflows.

Only Tensorflow/Keras distributed training with MultiWorkerMirroredStrategy is supported. Pytorch support will be added in other PRs. Distributed training using "paramter servers" will not be supported currently.

TODO:

Adding dependence steps before or after the distributed training step.
Pytorch support

set number of workers for the training step:

run the workflow:

How was this pull request tested?

TBD.

elyra-bot · 2022-11-07T03:06:51Z

Thanks for making a pull request to Elyra!

To try out this branch on binder, follow this link:

akchinSTC · 2022-11-07T19:31:28Z

Thanks @typhoonzero for another contribution!
Few thoughts,

After building the PR, I cant seem to find the new fields in the UI. (could just be my env, but I did a purge and fresh build)
looks like argo specific metadata labels are hardcoded, we will need to support kfp tekton as well. Thanks @ptitzler
lastly, these changes are very specific to TF. Having them displayed dynamically will probably require us to describe what libraries the images contain.. in the metadata, which opens up a big can of worms. Maybe have a ENV var as a flag instead to trigger displaying these extra options...shrug...will definitely require more discussion.

typhoonzero · 2022-11-30T02:50:25Z

@akchinSTC Thanks for the information. Actually, I'm looking for a more generic implementation, like workflow ParallelFor function, not only for distributed data parallel training, but also for parallel data processing features.

By setting a node as a parallelfor node (parallel count > 1), elyra should pass the following envs to the user program:

rank
nranks
runtime pod ip address for each rank

Then the user program can set either TF_CONFIG for Tensorflow distributed training or MASTER_ADDR for pytorch distributed training. In this case, the changes should be like:

A parallel-count property for each node.
Connect input and output to dependency nodes.
bootstrapper.py will set those envs at runtime.
Some examples for Tensorflow, Pytorch and general data processing.

typhoonzero added 3 commits November 7, 2022 10:52

init support for ditributed tf training step

fd6d846

dev

0e968b2

wip: self tested

8f4ee93

clean up

c56da5e

akchinSTC added status:Work in Progress Development in progress. A PR tagged with this label is not review ready unless stated otherwise. component:pipeline-editor pipeline editor platform: pipeline-Kubeflow Related to usage of Kubeflow Pipelines as pipeline runtime labels Nov 7, 2022

akchinSTC added the status:Waiting for Author label Nov 8, 2022

kevin-bates added status:Needs Discussion and removed status:Waiting for Author labels Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP: Support Tensorflow distributed training in kfp workflow #3000

WIP: Support Tensorflow distributed training in kfp workflow #3000

typhoonzero commented Nov 7, 2022

elyra-bot bot commented Nov 7, 2022

akchinSTC commented Nov 7, 2022

typhoonzero commented Nov 30, 2022

WIP: Support Tensorflow distributed training in kfp workflow #3000

Are you sure you want to change the base?

WIP: Support Tensorflow distributed training in kfp workflow #3000

Conversation

typhoonzero commented Nov 7, 2022

What changes were proposed in this pull request?

How was this pull request tested?

elyra-bot bot commented Nov 7, 2022

akchinSTC commented Nov 7, 2022

typhoonzero commented Nov 30, 2022