[Usage]: distributed inference with kuberay #4865

hetian127 · 2024-05-16T17:18:28Z

Your current environment

kuberay，vllm 0.4.0
L40 GPU server 2， each one with L408, CX6 IB card 200G*2

How would you like to use vllm

I plan to use KubeRay to implement multi-node distributed inference based on the vLLM framework. In the current environment, each GPU server node is interconnected with an IB network. How can I achieve RDMA between multiple nodes?

richardliaw · 2024-05-17T18:48:23Z

What type of distributed inferencing do you plan to do? Is it model parallel or data parallel?

hetian127 · 2024-05-20T02:20:56Z

I just want to use online api serving based on LLM like Qwen1.5-110B-Chat.
My main steps are as follows：
1, I made docker image include ofed driver, "ibstat" can show my 200G infiniband card.

2, create yaml file, like:

rayClusterConfig:
rayVersion: '2.9.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams:
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: repo:5000/harbor/rayvllm:v3
resources:
limits:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
requests:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: repo:5000/harbor/rayvllm:v3
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
requests:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"

3, I created a head node and a worker node by using kuberay with the image I made,and run the commad on the head node:

python -m vllm.entrypoints.openai.api_server
--model /path/Qwen1.5-110B-Chat
--tensor-parallel-size 16
--host 0.0.0.0
--trust-remote-code
--port 8000
--worker-use-ray

4, I run a benckmark scripts like:

python benchmarks/benchmark_serving.py
--backend vllm
--model /path/Qwen1.5-110B-Chat
--dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 5
--num-prompts 100
--host xxxx
--port 8000
--trust-remote-code \

I observed the Ray cluster's dashboard and found that the read/write throughput can reach up to 1.2GB/s, but it does not utilize the InfiniBand network bandwidth.

So, I just plan to use multiple nodes to perform distributed inference for large models, providing an OpenAI API server service, and using InfiniBand high-speed networks for communication between node

xiphl · 2024-05-20T02:33:30Z

i have similar use cases. Tested it in a DGX cluster, deliberately spread the falcon180b model to multiple nodes (and saw that the read/write per node is about 2-3GB/s).
i didnt set the USE_RDMA though

hetian127 added the usage How to use vllm label May 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: distributed inference with kuberay #4865

[Usage]: distributed inference with kuberay #4865

hetian127 commented May 16, 2024

richardliaw commented May 17, 2024

hetian127 commented May 20, 2024

xiphl commented May 20, 2024

[Usage]: distributed inference with kuberay #4865

[Usage]: distributed inference with kuberay #4865

Comments

hetian127 commented May 16, 2024

Your current environment

How would you like to use vllm

richardliaw commented May 17, 2024

hetian127 commented May 20, 2024

xiphl commented May 20, 2024