Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REQUEST] Launcher mode with SSH bypass #5510

Open
dogacancolak-kensho opened this issue May 8, 2024 · 1 comment
Open

[REQUEST] Launcher mode with SSH bypass #5510

dogacancolak-kensho opened this issue May 8, 2024 · 1 comment
Labels
enhancement New feature or request

Comments

@dogacancolak-kensho
Copy link

Is your feature request related to a problem? Please describe.
#2679
As previously mentioned in this issue, the existing launching mechanism requires password-less SSH. We did not prefer this at Kensho Technologies, as our current multi-node training framework uses a launching mechanism similar to torchrun.

Instead of a launcher node ssh-ing the command to the workers, torchrun works by providing a master address/port, and a node rank for each worker. By bypassing SSH and using deepspeed directly like torchrun, we can seamlessly integrate DeepSpeed to our existing setup, instead of having two different launching topologies.

Describe the solution you'd like
In a private fork of DeepSpeed, we were able to get training working without using SSH. To do this, we added a flag to the launcher-runner called --no_ssh, which also depends on a --node_rank flag to be provided.

Then, in the runner, the command is ran as if multi_node_exec is disabled. We have verified that this method works.

Describe alternatives you've considered
As mentioned, we considered setting up two topologies based on the framework used. For example, GPT-NeoX uses the deepspeed launcher, therefore we would need the SSH setup. However, MosaicML's llm-foundry works by independently running the command on each worker (similar to torchrun). We didn't want to create two architectures depending on which framework was being used for training.

Additional context
If deemed useful by the project maintainers, we can make a PR, with S&P Global/Kensho Technologies as the contributing entity.

@dogacancolak-kensho dogacancolak-kensho added the enhancement New feature or request label May 8, 2024
@tjruwase
Copy link
Contributor

tjruwase commented May 8, 2024

@dogacancolak-kensho, thanks for offering a PR for this useful enhancement. Please submit the PR at your convenience. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants