-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Usage]: distributed inference with kuberay #4865
Comments
What type of distributed inferencing do you plan to do? Is it model parallel or data parallel? |
I just want to use online api serving based on LLM like Qwen1.5-110B-Chat. 2, create yaml file, like: rayClusterConfig: 3, I created a head node and a worker node by using kuberay with the image I made,and run the commad on the head node: python -m vllm.entrypoints.openai.api_server 4, I run a benckmark scripts like: python benchmarks/benchmark_serving.py I observed the Ray cluster's dashboard and found that the read/write throughput can reach up to 1.2GB/s, but it does not utilize the InfiniBand network bandwidth. So, I just plan to use multiple nodes to perform distributed inference for large models, providing an OpenAI API server service, and using InfiniBand high-speed networks for communication between node |
i have similar use cases. Tested it in a DGX cluster, deliberately spread the falcon180b model to multiple nodes (and saw that the read/write per node is about 2-3GB/s). |
Your current environment
kuberay,vllm 0.4.0
L40 GPU server 2, each one with L408, CX6 IB card 200G*2
How would you like to use vllm
I plan to use KubeRay to implement multi-node distributed inference based on the vLLM framework. In the current environment, each GPU server node is interconnected with an IB network. How can I achieve RDMA between multiple nodes?
The text was updated successfully, but these errors were encountered: