Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Usage]: distributed inference with kuberay #4865

Open
hetian127 opened this issue May 16, 2024 · 3 comments
Open

[Usage]: distributed inference with kuberay #4865

hetian127 opened this issue May 16, 2024 · 3 comments
Labels
usage How to use vllm

Comments

@hetian127
Copy link

Your current environment

kuberay,vllm 0.4.0
L40 GPU server 2, each one with L408, CX6 IB card 200G*2

How would you like to use vllm

I plan to use KubeRay to implement multi-node distributed inference based on the vLLM framework. In the current environment, each GPU server node is interconnected with an IB network. How can I achieve RDMA between multiple nodes?

@hetian127 hetian127 added the usage How to use vllm label May 16, 2024
@richardliaw
Copy link
Collaborator

What type of distributed inferencing do you plan to do? Is it model parallel or data parallel?

@hetian127
Copy link
Author

I just want to use online api serving based on LLM like Qwen1.5-110B-Chat.
My main steps are as follows:
1, I made docker image include ofed driver, "ibstat" can show my 200G infiniband card.

2, create yaml file, like:

rayClusterConfig:
rayVersion: '2.9.0' # should match the Ray version in the image of the containers
######################headGroupSpecs#################################
# Ray head pod template.
headGroupSpec:
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams:
dashboard-host: '0.0.0.0'
#pod template
template:
spec:
containers:
- name: ray-head
image: repo:5000/harbor/rayvllm:v3
resources:
limits:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
requests:
nvidia.com/gpu: 8
cpu: 8
memory: 64Gi
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
ports:
- containerPort: 6379
name: gcs-server
- containerPort: 8265 # Ray dashboard
name: dashboard
- containerPort: 10001
name: client
- containerPort: 8000
name: serve
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"
workerGroupSpecs:
# the pod replicas in this group typed worker
- replicas: 1
minReplicas: 1
maxReplicas: 5
# logical group name, for this called small-group, also can be functional
groupName: small-group
# The rayStartParams are used to configure the ray start command.
# See https://github.com/ray-project/kuberay/blob/master/docs/guidance/rayStartParams.md for the default settings of rayStartParams in KubeRay.
# See https://docs.ray.io/en/latest/cluster/cli.html#ray-start for all available options in rayStartParams.
rayStartParams: {}
#pod template
template:
spec:
containers:
- name: ray-worker # must consist of lower case alphanumeric characters or '-', and must start and end with an alphanumeric character (e.g. 'my-name', or '123-abc'
image: repo:5000/harbor/rayvllm:v3
lifecycle:
preStop:
exec:
command: ["/bin/sh","-c","ray stop"]
resources:
limits:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
requests:
nvidia.com/gpu: 8
cpu: "8"
memory: "64Gi"
volumeMounts:
- name: share
mountPath: "/share"
- name: shm
mountPath: "/dev/shm"
env:
- name: USE_RDMA
value: "true"
volumes:
- name: share
hostPath:
path: "/share"
type: Directory
- name: shm
emptyDir:
medium: Memory
sizeLimit: "64Gi"

3, I created a head node and a worker node by using kuberay with the image I made,and run the commad on the head node:

python -m vllm.entrypoints.openai.api_server
--model /path/Qwen1.5-110B-Chat
--tensor-parallel-size 16
--host 0.0.0.0
--trust-remote-code
--port 8000
--worker-use-ray

4, I run a benckmark scripts like:

python benchmarks/benchmark_serving.py
--backend vllm
--model /path/Qwen1.5-110B-Chat
--dataset benchmarks/ShareGPT_V3_unfiltered_cleaned_split.json
--request-rate 5
--num-prompts 100
--host xxxx
--port 8000
--trust-remote-code \

I observed the Ray cluster's dashboard and found that the read/write throughput can reach up to 1.2GB/s, but it does not utilize the InfiniBand network bandwidth.

So, I just plan to use multiple nodes to perform distributed inference for large models, providing an OpenAI API server service, and using InfiniBand high-speed networks for communication between node

@xiphl
Copy link

xiphl commented May 20, 2024

i have similar use cases. Tested it in a DGX cluster, deliberately spread the falcon180b model to multiple nodes (and saw that the read/write per node is about 2-3GB/s).
i didnt set the USE_RDMA though

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
usage How to use vllm
Projects
None yet
Development

No branches or pull requests

3 participants