You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When benchmarking multiple generative models with multiple GPUs, we use the underlying Ray cluster in vLLM, and in between each model we call a ray.shutdown() to shutdown the cluster to open a new one with the new model. This works, but only one of the GPUs has its cache reset, meaning that we encounter an OOM error when we try to benchmark the next model.
Minimal example in a multi-GPU setup:
scandeval -l da -l sentiment-classification -m mhenrichsen/danskgpt-tiny -m mhenrichsen/danskgpt-tiny-chat
馃悰 Describe the bug
When benchmarking multiple generative models with multiple GPUs, we use the underlying Ray cluster in vLLM, and in between each model we call a
ray.shutdown()
to shutdown the cluster to open a new one with the new model. This works, but only one of the GPUs has its cache reset, meaning that we encounter an OOM error when we try to benchmark the next model.Minimal example in a multi-GPU setup:
Relevant vLLM issue: vllm-project/vllm#4241
We should thus either fix this memory leak or somehow use the same Ray cluster for the new model, without shutting it down at all.
Operating System
Linux
Device
CUDA GPU
Python version
3.11.x
ScandEval version
12.7.0
The text was updated successfully, but these errors were encountered: