Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Ray Autoscaler] Ray autoscaler does not scale up effectively and fast #45373

Open
raghumdani opened this issue May 16, 2024 · 0 comments
Open
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks performance

Comments

@raghumdani
Copy link

raghumdani commented May 16, 2024

What happened + What you expected to happen

We run Ray jobs in production. Right after upgrading Ray version from 2.3.0 to 2.20.0, we saw a significant increase in job latency. Upon investigation, we found that autoscaler wasn't spinning up new nodes even when majority of the tasks were in queue waiting to be scheduled which resulted in latency increase overall. We only schedule by memory and these jobs weren't using full memory they requested for. However, we expect autoscaler to still spin up new nodes to serve the demand. This issues does not occur with SPREAD scheduling strategy however (not sure why!).

Versions / Dependencies

Ray: 2.20.0
OS: Ubuntu 20.04
Python: 3.10

Reproduction script

You might need to install deltacat by pip3 install deltacat. Each worker has 31 CPU, 220GB memory and 10000 max_tasks.

from deltacat.utils.ray_utils.concurrency import (
    invoke_parallel,
)
import ray
import time

ray.init(address='auto')

@ray.remote
def i_will_return_after_t_secs(t):
    import numpy as np

    all_arrs = []

    for _ in range(10):
        all_arrs.append(np.random.rand(100000000)) # 0.8GB

    # 8 GB overall

    import time
    time.sleep(t)

def options_provider(*args, **kwargs):
    return {"num_cpus": 0.01, "memory": 23387571131, "resources": {'max_tasks': 0.001} }

items_to_run = [120 for _ in range(14000)]

start = time.monotonic()
print(f"Now starting: {start}")

pending = invoke_parallel(items_to_run, i_will_return_after_t_secs, max_parallelism=4096, options_provider=options_provider)

end_invoke = time.monotonic()

print(f"Now ended invoke: {end_invoke - start}")

ray.get(pending)

end = time.monotonic()
print(f"Complete: {end - start}")

# 2.3 used 460 nodes while 2.20 used 364 nodes at max.

Issue Severity

None

@raghumdani raghumdani added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 16, 2024
@anyscalesam anyscalesam added performance core-autoscaler autoscaler related issues core Issues that should be addressed in Ray Core labels May 16, 2024
@rynewang rynewang added @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core core-autoscaler autoscaler related issues @external-author-action-required Alternate tag for PRs where the author doesn't have labeling permission. P1 Issue that should be fixed within a few weeks performance
Projects
None yet
Development

No branches or pull requests

4 participants