You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I had searched in the issues and found no similar feature requirement.
Description
If I have a TorchTrainer running, doing some work, and I drain the node where my head pod is running, nothing ever seems to actually recover. I've enabled FT GCS in KubeRay helm chart version v1.1.1 -- I have an external redis that has all the state in it, etc.
Is there truly no way to have my head pod, which is running on a spot node, survive being rescheduled? It doesn't seem like the head node can do any recovery whatsoever for jobs that were in the middle of training.
Use case
No response
Related issues
No response
Are you willing to submit a PR?
Yes I am willing to submit a PR!
The text was updated successfully, but these errors were encountered:
Thanks for responding @kevin85421. My original question still stands then. Is there no way to have the jobs running in progress survive after the Ray head gets restarted? I see that there's a way to resume a trainer from previous state but do I always have to re-submit that if the Ray head is gone?
My current understanding is that Ray Train provides some degree of fault tolerance.
If a Ray worker Pod crashes, Ray Train will launch new Ray tasks or actors, allowing the job to continue running.
If the Ray head crashes, the driver process, which is typically running on the Ray head of the Ray job, will also crash. Therefore, the Ray tasks or actors will also be garbage collected automatically. In most cases, Ray tasks or actors are fate-sharing with the driver process. Only detached actors do not share this fate with the driver.
In this case, users need to write the fault tolerance logic at the application level, specifically in their Ray Python script.
if (checkpointexists):
readcheckpointelse:
startfromscratchtrainthemodel
The long-running Ray job is currently on our roadmap of the next release. We are currently working on:
Retry mechanism in both RayJob CRD / ray job submit.
Best practice for checkpointing.
If you are interested in this topic, you can reach out to me on Ray Slack (my handle is 'Kai-Hsun Chen (ray team)'). We can discuss your requirements and ensure there are no feature gaps for your use cases.
Search before asking
Description
If I have a
TorchTrainer
running, doing some work, and I drain the node where my head pod is running, nothing ever seems to actually recover. I've enabled FT GCS in KubeRay helm chart version v1.1.1 -- I have an external redis that has all the state in it, etc.Is there truly no way to have my head pod, which is running on a spot node, survive being rescheduled? It doesn't seem like the head node can do any recovery whatsoever for jobs that were in the middle of training.
Use case
No response
Related issues
No response
Are you willing to submit a PR?
The text was updated successfully, but these errors were encountered: