FT GCS should handle draining of node where head pod is scheduled #2153

abatilo · 2024-05-17T05:19:52Z

Search before asking

I had searched in the issues and found no similar feature requirement.

Description

If I have a TorchTrainer running, doing some work, and I drain the node where my head pod is running, nothing ever seems to actually recover. I've enabled FT GCS in KubeRay helm chart version v1.1.1 -- I have an external redis that has all the state in it, etc.

Is there truly no way to have my head pod, which is running on a spot node, survive being rescheduled? It doesn't seem like the head node can do any recovery whatsoever for jobs that were in the middle of training.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

Yes I am willing to submit a PR!

The text was updated successfully, but these errors were encountered:

kevin85421 · 2024-05-18T17:27:06Z

Hi @abatilo, thank you for opening the issue. You may have some misunderstanding for GCS FT. You can read https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-gcs-ft for more details. Currently, the only use case of GCS FT is for Ray Serve high availability.

abatilo · 2024-05-18T18:12:53Z

Thanks for responding @kevin85421. My original question still stands then. Is there no way to have the jobs running in progress survive after the Ray head gets restarted? I see that there's a way to resume a trainer from previous state but do I always have to re-submit that if the Ray head is gone?

kevin85421 · 2024-05-18T22:50:07Z

My current understanding is that Ray Train provides some degree of fault tolerance.

If a Ray worker Pod crashes, Ray Train will launch new Ray tasks or actors, allowing the job to continue running.
If the Ray head crashes, the driver process, which is typically running on the Ray head of the Ray job, will also crash. Therefore, the Ray tasks or actors will also be garbage collected automatically. In most cases, Ray tasks or actors are fate-sharing with the driver process. Only detached actors do not share this fate with the driver.
- In this case, users need to write the fault tolerance logic at the application level, specifically in their Ray Python script.
```
if (checkpoint exists):
  read checkpoint
else:
  start from scratch

train the model
```

The long-running Ray job is currently on our roadmap of the next release. We are currently working on:

Retry mechanism in both RayJob CRD / ray job submit.
Best practice for checkpointing.

If you are interested in this topic, you can reach out to me on Ray Slack (my handle is 'Kai-Hsun Chen (ray team)'). We can discuss your requirements and ensure there are no feature gaps for your use cases.

abatilo added enhancement New feature or request triage labels May 17, 2024

kevin85421 added gcs ft and removed triage labels May 17, 2024

kevin85421 self-assigned this May 18, 2024

kevin85421 added rayjob long-running job 1.2.0 labels May 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

FT GCS should handle draining of node where head pod is scheduled #2153

FT GCS should handle draining of node where head pod is scheduled #2153

abatilo commented May 17, 2024

kevin85421 commented May 18, 2024

abatilo commented May 18, 2024

kevin85421 commented May 18, 2024

FT GCS should handle draining of node where head pod is scheduled #2153

FT GCS should handle draining of node where head pod is scheduled #2153

Comments

abatilo commented May 17, 2024

Search before asking

Description

Use case

Related issues

Are you willing to submit a PR?

kevin85421 commented May 18, 2024

abatilo commented May 18, 2024

kevin85421 commented May 18, 2024