Report workers presence #288

jbbarth · 2017-08-18T16:11:33Z

Simpleflow should be able to report workers usage to a given datastore (say Redis) so that it can be consumed by other systems (webflow, an autoscale mechanism). The idea is to fill the gap for things missing in the SWF API from an operational perspective.

Implementation

Simpleflow has no server attached, and I believe it would be a huge waste of energy to introduce such a concept at this point. If we need complex things we can rely on Webflow, but here we can just share a datastore like Redis or DynamoDB. It has to support TTLs or equivalent for the messages so we keep up-to-date informations with a purely passive mechanism.

Either each worker/decider could register against the datastore, or only the supervisor process. The latter is trivial to implement but it would be a bit hacky to get past the number of subprocesses ; their "status" (processing, polling, stopping) cannot be retrieved directly, only via a ps output. And other informations (which workflow they're working for, maybe the task token) cannot be accessed at all with the current design.

So we should do it at worker level, and optionally add something at the supervisor process level (for instance to get a reliable "running/stopping" status, which we don't manage to have on workers, see #283 #205).

This feature should not be mandatory to make simpleflow work.

Use case 1: get a platform usage overview

It's very hard to know simpleflow workers usage for scaling effectively. The only global information available via SWF endpoints is the "backlog" for a given "task list", but 1/ you need to effectively know the task lists in advance to monitor them (which is annoying when some are dynamic), and 2/ you only get an alert when it's too late, == you reached 100% of workers busy and you need more. In some cases this can be really annoying, because if we knew the platform was 95% saturated and big workflow executions are coming, we would anticipate the scale up.

Having a real platform usage can help autoscaling tremendously, and maybe we could even build usage stats on top of it to understand where are the possible cost gains, and which worker types are over/under sized.

Use case 2: monitor worker deployments / instances stop / instances start

We could get a map of all our processes with the instances stopping, the version of the code being used, and even as a side effect we could monitor some worker stop problems (if the supervisor process reports being in "stopping" state from time T1 and new workers boot after that T1, there's a problem (I doubt we will ever have the time/energy to implement this, just thinking out loud).

Use case 3: pilot workers interactively (far future)

The presence of workers is nice, but the same mechanism could also be used to tell workers to stop for instance (next time they update the presence API, they can receive an order). Or change the number of workers for a given task list, whatever. Or kill the current task.

Use case 4: side-channel for listing task lists

As SWF doesn't expose this, well, this could be useful.

The text was updated successfully, but these errors were encountered:

jbbarth · 2017-08-18T16:18:25Z

An inspiration for this is https://github.com/mperham/sidekiq. This is done at least partially here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Report workers presence #288

Report workers presence #288

jbbarth commented Aug 18, 2017 •

edited

jbbarth commented Aug 18, 2017

Report workers presence #288

Report workers presence #288

Comments

jbbarth commented Aug 18, 2017 • edited

jbbarth commented Aug 18, 2017

jbbarth commented Aug 18, 2017 •

edited