Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FT GCS should handle draining of node where head pod is scheduled #2153

Open
1 of 2 tasks
abatilo opened this issue May 17, 2024 · 3 comments
Open
1 of 2 tasks

FT GCS should handle draining of node where head pod is scheduled #2153

abatilo opened this issue May 17, 2024 · 3 comments

Comments

@abatilo
Copy link

abatilo commented May 17, 2024

Search before asking

  • I had searched in the issues and found no similar feature requirement.

Description

If I have a TorchTrainer running, doing some work, and I drain the node where my head pod is running, nothing ever seems to actually recover. I've enabled FT GCS in KubeRay helm chart version v1.1.1 -- I have an external redis that has all the state in it, etc.

Is there truly no way to have my head pod, which is running on a spot node, survive being rescheduled? It doesn't seem like the head node can do any recovery whatsoever for jobs that were in the middle of training.

Use case

No response

Related issues

No response

Are you willing to submit a PR?

  • Yes I am willing to submit a PR!
@abatilo abatilo added enhancement New feature or request triage labels May 17, 2024
@kevin85421 kevin85421 added gcs ft and removed triage labels May 17, 2024
@kevin85421
Copy link
Member

Hi @abatilo, thank you for opening the issue. You may have some misunderstanding for GCS FT. You can read https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/kuberay-gcs-ft.html#kuberay-gcs-ft for more details. Currently, the only use case of GCS FT is for Ray Serve high availability.

@kevin85421 kevin85421 self-assigned this May 18, 2024
@abatilo
Copy link
Author

abatilo commented May 18, 2024

Thanks for responding @kevin85421. My original question still stands then. Is there no way to have the jobs running in progress survive after the Ray head gets restarted? I see that there's a way to resume a trainer from previous state but do I always have to re-submit that if the Ray head is gone?

@kevin85421
Copy link
Member

My current understanding is that Ray Train provides some degree of fault tolerance.

  • If a Ray worker Pod crashes, Ray Train will launch new Ray tasks or actors, allowing the job to continue running.
  • If the Ray head crashes, the driver process, which is typically running on the Ray head of the Ray job, will also crash. Therefore, the Ray tasks or actors will also be garbage collected automatically. In most cases, Ray tasks or actors are fate-sharing with the driver process. Only detached actors do not share this fate with the driver.
    • In this case, users need to write the fault tolerance logic at the application level, specifically in their Ray Python script.
      if (checkpoint exists):
        read checkpoint
      else:
        start from scratch
      
      train the model

The long-running Ray job is currently on our roadmap of the next release. We are currently working on:

  • Retry mechanism in both RayJob CRD / ray job submit.
  • Best practice for checkpointing.

If you are interested in this topic, you can reach out to me on Ray Slack (my handle is 'Kai-Hsun Chen (ray team)'). We can discuss your requirements and ensure there are no feature gaps for your use cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants