Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heartbeat timeout on long running tasks #359

Open
nstott opened this issue Feb 13, 2019 · 0 comments
Open

Heartbeat timeout on long running tasks #359

nstott opened this issue Feb 13, 2019 · 0 comments

Comments

@nstott
Copy link
Contributor

nstott commented Feb 13, 2019

Hi All, I'm relatively new to simpleflow, and having some trouble understanding what the best practice is for long running jobs.

My workflow consists of a few tasks, one of which involves running an external process to crunch some data, and can take anywhere between 1 and 2 hours.

When this long task is running, the worker doesn't seem to be sending heartbeats, so I've set the heartbeat timeout to something unreasonable, so that the swf task doesn't fail due to a timeout.

The problem I'm having is that periodically my worker processes can crash (OOM, or due to other general kubernetes malfeasance), and because of the long heartbeat timeout, the workflow doesn't retry the failed task until the very end.

I'm looking for a way to continue to send heartbeats while the worker is occupied, or to find some other way to retry quickly on a failed worker. I'm not sure what the right pattern is for this approach

I'm not sure if this is related to #239

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant