Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency tradeoff could be better when polling with a long timeout and count > 1 in MySQL and Postgres persistences #142

Open
Jiehong opened this issue Apr 26, 2024 · 0 comments

Comments

@Jiehong
Copy link

Jiehong commented Apr 26, 2024

Describe the bug
If all workers poll for tasks with a count=10 for example, and a timeout=5s, very few things get polled during low activities.

This is not really a bug, but maybe a tradeoff to clarify / change.

Details
Conductor version: 3.19.0

To Reproduce

  1. create a workflow with 1 single task
  2. start a worker polling for that task with count=2 and timeout=5s
  3. execute the workflow
  4. check how long the task stays in the queue
  5. Repeat every 5 seconds

What happens is that the DAO implementations try at least "count" messages from the queue while the timeout is not elapsed.

MySQL:

while (messages.size() < count && ((System.currentTimeMillis() - start) < timeout)) {

Postgres:

if (messages.size() >= count || ((System.currentTimeMillis() - start) > timeout)) {

Dyno: https://github.com/conductor-oss/conductor/blob/main/redis-persistence/src/main/java/com/netflix/conductor/redis/dao/DynoQueueDAO.java#L96, which relies on https://github.com/Netflix/dyno-queues/blob/dev/dyno-queues-redis/src/main/java/com/netflix/dyno/queues/redis/RedisDynoQueue.java#L343

Expected behavior
In case of low activity, the pop method should wait less time, and return a task anyways instead of trying to maximise the number of tasks polled.

Why?

I think we have multiple cases, depending on the number of tasks to poll:

  • high activity: very high number of tasks in the queue, so long polling is barely relevant, and it's easy to poll for count number of tasks. Count is important to achieve high throughput.
  • very low activity: tasks arrive in the queue at a slow rate, and so long polling is relevant to improve latency, but waiting for count tasks to be there is waiting for nothing. Long polling is more relevant than count in this case, and a task should be return earlier.
  • medium activity: tasks arrive in the queue so that maybe 50% of workers get their count filled, and 50% of them get less. In this case, we get some tasks being processed fast, while others wait much longer. With medium activity, a slower but snappier flow would be better, and thus count should matter a bit less than long polling.

As a consequence, I think count should not be a value that should be that hard on the polling, but its impact is more important when long polling makes less sense.

I guess there are multiple ways to fix it:

  • pop until count is 1 or more OR until the time has elapsed
  • pop until count is at least half of the requested count OR until the time has elapsed.
  • maybe depending on the value of pop, change the strategy: if pop is like 1000, then keep the current behaviour, as the server always expects high load and throughput is preferred. If count is low (like 3), then does it as a best effort with returning up to 3 within the time period (just like 1).

What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant