Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We鈥檒l occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Memory leak when benchmarking multiple generative models with multiple GPUs #413

Open
saattrupdan opened this issue Apr 23, 2024 · 0 comments
Labels
bug Something isn't working

Comments

@saattrupdan
Copy link
Member

馃悰 Describe the bug

When benchmarking multiple generative models with multiple GPUs, we use the underlying Ray cluster in vLLM, and in between each model we call a ray.shutdown() to shutdown the cluster to open a new one with the new model. This works, but only one of the GPUs has its cache reset, meaning that we encounter an OOM error when we try to benchmark the next model.

Minimal example in a multi-GPU setup:

scandeval -l da -l sentiment-classification -m mhenrichsen/danskgpt-tiny -m mhenrichsen/danskgpt-tiny-chat

Relevant vLLM issue: vllm-project/vllm#4241

We should thus either fix this memory leak or somehow use the same Ray cluster for the new model, without shutting it down at all.

Operating System

Linux

Device

CUDA GPU

Python version

3.11.x

ScandEval version

12.7.0

@saattrupdan saattrupdan added the bug Something isn't working label Apr 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant