Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there any plan to support deepspeed job plugin for distributed training? #3440

Closed
rockburning opened this issue Apr 25, 2024 · 7 comments
Closed
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@rockburning
Copy link

Deepspeed is now very popular in distributing training for ai scenario.Hope can support it to enhance volcano‘s ability.
thanks

@rockburning rockburning added the kind/feature Categorizes issue or PR as related to a new feature. label Apr 25, 2024
@hwdef
Copy link
Member

hwdef commented Apr 25, 2024

Can you describe the specific scene?

@rockburning
Copy link
Author

rockburning commented Apr 25, 2024

Can you describe the specific scene?
training on multi node using deepspeed. in this case, need meet 2 condition. 1. need to ssh without password beteween pods (may can use ssh plugin); 2. need to know specifil hostfile. as it should specify the hostfile --hostfile. (may use svc to generate headless svc) .so the question is i need to kown the woker pod's name and genarate hostfile and mount it to pod.
I want to use deepspeed framework to train my pytorch job using deepspeed to accelerate my training;
but it seems volcano don't support using deepspeed directly; as deepspeed framework need to specify the hostfile between diffrent job.so is there any solutions that can use mpi directy without support plugin. you can refer: https://www.deepspeed.ai/getting-started/ on chapter: DeepSpeed Resource Configuration (multi-node)

@GitEasonXu
Copy link

@rockburning

May I ask if your attempt was successful?

@rockburning
Copy link
Author

@rockburning

May I ask if your attempt was successful?
yes just use svc plugin;and utilize the headless svc dns record;

@rockburning
Copy link
Author

@rockburning
May I ask if your attempt was successful?
yes just use svc plugin;and utilize the headless svc dns record;

slot_value="${1:-8}"

this is the sample shell code to get all the host
content=""
for file in /etc/volcano/*.host; do
file_content=$(cat "$file" | tr '\n' ' ')
content="$content$file_content slot=$slot_value\n"
done

echo -e "${content% }" > /etc/deepspeed-hostfile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

3 participants