Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Report workers presence #288

Open
jbbarth opened this issue Aug 18, 2017 · 1 comment
Open

Report workers presence #288

jbbarth opened this issue Aug 18, 2017 · 1 comment

Comments

@jbbarth
Copy link
Collaborator

jbbarth commented Aug 18, 2017

Simpleflow should be able to report workers usage to a given datastore (say Redis) so that it can be consumed by other systems (webflow, an autoscale mechanism). The idea is to fill the gap for things missing in the SWF API from an operational perspective.

Implementation

Simpleflow has no server attached, and I believe it would be a huge waste of energy to introduce such a concept at this point. If we need complex things we can rely on Webflow, but here we can just share a datastore like Redis or DynamoDB. It has to support TTLs or equivalent for the messages so we keep up-to-date informations with a purely passive mechanism.

Either each worker/decider could register against the datastore, or only the supervisor process. The latter is trivial to implement but it would be a bit hacky to get past the number of subprocesses ; their "status" (processing, polling, stopping) cannot be retrieved directly, only via a ps output. And other informations (which workflow they're working for, maybe the task token) cannot be accessed at all with the current design.

So we should do it at worker level, and optionally add something at the supervisor process level (for instance to get a reliable "running/stopping" status, which we don't manage to have on workers, see #283 #205).

This feature should not be mandatory to make simpleflow work.

Use case 1: get a platform usage overview

It's very hard to know simpleflow workers usage for scaling effectively. The only global information available via SWF endpoints is the "backlog" for a given "task list", but 1/ you need to effectively know the task lists in advance to monitor them (which is annoying when some are dynamic), and 2/ you only get an alert when it's too late, == you reached 100% of workers busy and you need more. In some cases this can be really annoying, because if we knew the platform was 95% saturated and big workflow executions are coming, we would anticipate the scale up.

Having a real platform usage can help autoscaling tremendously, and maybe we could even build usage stats on top of it to understand where are the possible cost gains, and which worker types are over/under sized.

Use case 2: monitor worker deployments / instances stop / instances start

We could get a map of all our processes with the instances stopping, the version of the code being used, and even as a side effect we could monitor some worker stop problems (if the supervisor process reports being in "stopping" state from time T1 and new workers boot after that T1, there's a problem (I doubt we will ever have the time/energy to implement this, just thinking out loud).

Use case 3: pilot workers interactively (far future)

The presence of workers is nice, but the same mechanism could also be used to tell workers to stop for instance (next time they update the presence API, they can receive an order). Or change the number of workers for a given task list, whatever. Or kill the current task.

Use case 4: side-channel for listing task lists

As SWF doesn't expose this, well, this could be useful.

@jbbarth
Copy link
Collaborator Author

jbbarth commented Aug 18, 2017

An inspiration for this is https://github.com/mperham/sidekiq. This is done at least partially here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

1 participant