Skip to content
This repository has been archived by the owner on May 28, 2024. It is now read-only.

[BUG] workers do not launch on g5.12xlarges for the latest image 0.5.0. #125

Open
JGSweets opened this issue Jan 22, 2024 · 6 comments
Open

Comments

@JGSweets
Copy link

I'm stuck in a repeat deployment loop when utilizing the image anyscale/ray-llm:latest on a g5.12xlarge instance. It seems the worker never connects back which leads me to believe an error on deployment of docker image. I didn't notice any error logs reported to the head node during deployment.

This caused a repeated loop for deploying and shutting down workers.
Possibly due to the CUDA updates, but I'm not 100% sure?

anyscale/ray-llm:0.4.0 launches as expected with no configuration changes.

@sihanwang41
Copy link
Collaborator

Hi, please provide repo step if possible, so that our team can help to take a look!

@JGSweets
Copy link
Author

  1. Update config to match requirements of my AWS env.
    • SGs
    • region
    • updated gpu_worker_g5 to include CPU and GPU values.
  2. Deploy via Ray up
  3. Use Ray attach.
  4. Use rayllm run --model models/continuous_batching/amazon--LightGPT.yaml
    • continuous loop on deploy.

@JGSweets
Copy link
Author

I don't believe the AMI has the drivers installed for CUDA 12. Could that be the issue?

alanwguo pushed a commit that referenced this issue Jan 25, 2024
Signed-off-by: Max Pumperla <max.pumperla@googlemail.com>
@JGSweets
Copy link
Author

JGSweets commented Feb 2, 2024

@sihanwang41 an update on investigating this issue?

@JGSweets
Copy link
Author

JGSweets commented Feb 6, 2024

FWIW, ray-llm is not deployable in the current state on images >= 0.5.0. This is not limited to g5.12xlarges.

@SamComber
Copy link

+1 on this, I'm having to use 0.4.0 else DEPLOYING stuck in loop with 0.5.0 @JGSweets (thanks for your comment, got me up and running)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants