Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve heartbeat failure messaging #358

Open
obgibson opened this issue Apr 13, 2023 · 0 comments
Open

Improve heartbeat failure messaging #358

obgibson opened this issue Apr 13, 2023 · 0 comments
Assignees

Comments

@obgibson
Copy link
Collaborator

  1. Expose document explaining how heartbeats are used to mark runs and tasks as failed. This document can be at metaflow.org and in the READMEs of this repository. This will be in addition to https://github.com/Netflix/metaflow-service/blob/master/services/ui_backend_service/docs/environment.md#heartbeat-intervals
  2. When a task or run fails because of a missing heartbeat, show that fact in MFGUI.
  3. Have a default minimum heartbeat and a maximum heartbeat time. If the task/run misses the minimum heartbeat, show it as "pending" and only show it as "failed" when it misses the maximum heartbeat time. This functionality will have to consider resumes and multiple attempts.

The reason for this issue is that some runs/tasks are being marked as "failed" when they have not started yet, and some runs/tasks are still marked as "running" when they have failed but not reached the heartbeat threshold yet.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants