Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only error tasks if the failure is within the task implementation #300

Open
nsaje opened this issue Sep 27, 2023 · 4 comments
Open

Only error tasks if the failure is within the task implementation #300

nsaje opened this issue Sep 27, 2023 · 4 comments

Comments

@nsaje
Copy link
Contributor

nsaje commented Sep 27, 2023

Errors like TaskImportError or errors stemming from a broken child context manager shouldn't move the task to the ERROR state - it should remain in QUEUED.

This way when we somehow break TaskTiger outside of the task implementation we don't need to worry about which tasks should be retried and make recovery far easier.

To implement this we'd need a way for the child process to signal an "abort". Currently we return exit code 0 in case of success and 1 in case of error. We could add exit code 2 to signify an abortion of task processing, in which case the parent process could move the task back into the QUEUED state.

WDYT @thomasst ?

@tsx
Copy link
Contributor

tsx commented Oct 2, 2023

Not all TaskImportErrors are equal. There are two kinds:

  • When we expect the task to become importable on its own - this typically happens as race conditions during deployments, for example when a deployment that schedules a new task was rolled out but the workers that handle that task are not aware of it just yet. These will recover automatically, so leaving them in Queued state would be fine.
  • When the task reference is actually broken, like it was removed or moved and there were in-flight tasks that reference the old (now not existing) name. These need manual intervention and leaving them in Queued means we'll ignore a real problem.

Unless we have a reliable way to distinguish between the two kinds of failures (honestly I don't see how a worker can know whether the task is expected to come in the future or not - we don't have the crystal balls necessary for that), I wouldn't recommend leaving those tasks as Queued.

That said, keeping a list of tasks that have failed with an import error, and retrying them on next worker restart (redeployment presumably) might make recovery (in both scenarios) easier by reducing the need for manually tracking down tasks and removing them.

As for the signalling mechanism - I don't think using exit codes is the best idea. I think the child process can just catch the appropriate ImportError exceptions and write some richer state/report to TT redis instead of trying to communicate through a 1-bit channel. It already stores the execution data from the forked child, so why not add it there.

@nsaje
Copy link
Contributor Author

nsaje commented Oct 2, 2023

I think both of those cases would be better served by leaving the task in Queued state but still loudly logging an error (rollbar etc.). This way the manual intervention needed would be reduced to just fixing the issue in the code, whereas right now you also have to figure out exactly which tasks failed because of this issue and retry them manually.

The parent process handles the state/queue transitions so we do need to get that info back from the child somehow. Passing it via a Redis entry sounds fine as well, I don't have a strong opinion on it.

@thomasst
Copy link
Member

thomasst commented Oct 4, 2023

Do you have an example where a broken child context manager caused a TaskImportError?

@nsaje
Copy link
Contributor Author

nsaje commented Oct 4, 2023

No, there's an "or" in there :-) e.g. either TaskImportError or any error caused by the context managers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants