[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373
Labels
bug
Something that is supposed to be working; but isn't
core
Issues that should be addressed in Ray Core
core-autoscaler
autoscaler related issues
@external-author-action-required
Alternate tag for PRs where the author doesn't have labeling permission.
P1
Issue that should be fixed within a few weeks
performance
What happened + What you expected to happen
We run Ray jobs in production. Right after upgrading Ray version from 2.3.0 to 2.20.0, we saw a significant increase in job latency. Upon investigation, we found that autoscaler wasn't spinning up new nodes even when majority of the tasks were in queue waiting to be scheduled which resulted in latency increase overall. We only schedule by memory and these jobs weren't using full memory they requested for. However, we expect autoscaler to still spin up new nodes to serve the demand. This issues does not occur with SPREAD scheduling strategy however (not sure why!).
Versions / Dependencies
Ray: 2.20.0
OS: Ubuntu 20.04
Python: 3.10
Reproduction script
You might need to install deltacat by
pip3 install deltacat
. Each worker has 31 CPU, 220GB memory and 10000 max_tasks.Issue Severity
None
The text was updated successfully, but these errors were encountered: