Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Core] Change the source of the ray_tasks metric for finished or failed tasks to have a more accurate count. #45333

Open
alanwguo opened this issue May 14, 2024 · 1 comment
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks

Comments

@alanwguo
Copy link
Contributor

What happened + What you expected to happen

Collect a global counter of num_finished or num_failed tasks in the head node to export a metric.

The current distributed counter approach runs into problems with the node dies and the node's count of total finished or failed tasks gets wiped out.

We worked around this in the grafana dashboard by doing a max_over_time for each of these counts, but that can be very slow since we scan the past 14 days of time data

Versions / Dependencies

ray 2.21.0

Reproduction script

simple repro:

import ray

ray.init("auto")

@ray.remote
def foo():
  return "hi"

ray.get([foo.remote() for _ in range(100)])

Open the grafana dashboard and go to the metrics page. See the tasks graph. If the number of tasks is very large and the cluster is alive for a long time, this graph can be too slow to even load.

Issue Severity

Medium: It is a significant difficulty but I can work around it.

@alanwguo alanwguo added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 14, 2024
@alanwguo alanwguo changed the title [<Ray component: Core] [Core] Change the source of the ray_tasks metric for finished or failed tasks to have a more accurate count. May 14, 2024
@anyscalesam anyscalesam added the core Issues that should be addressed in Ray Core label May 20, 2024
@rynewang
Copy link
Contributor

Do you have a PR for this?

@rynewang rynewang added P1 Issue that should be fixed within a few weeks and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels May 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P1 Issue that should be fixed within a few weeks
Projects
None yet
Development

No branches or pull requests

3 participants