Latency tradeoff could be better when polling with a long timeout and count > 1 in MySQL and Postgres persistences #142

Jiehong · 2024-04-26T08:30:25Z

Describe the bug
If all workers poll for tasks with a count=10 for example, and a timeout=5s, very few things get polled during low activities.

This is not really a bug, but maybe a tradeoff to clarify / change.

Details
Conductor version: 3.19.0

To Reproduce

create a workflow with 1 single task
start a worker polling for that task with count=2 and timeout=5s
execute the workflow
check how long the task stays in the queue
Repeat every 5 seconds

What happens is that the DAO implementations try at least "count" messages from the queue while the timeout is not elapsed.

MySQL:

conductor/mysql-persistence/src/main/java/com/netflix/conductor/mysql/dao/MySQLQueueDAO.java

Line 349 in 55268f0

    
           while (messages.size() < count && ((System.currentTimeMillis() - start) < timeout)) {

Postgres:

conductor/postgres-persistence/src/main/java/com/netflix/conductor/postgres/dao/PostgresQueueDAO.java

Line 169 in 55268f0

    
           if (messages.size() >= count || ((System.currentTimeMillis() - start) > timeout)) {

Dyno: https://github.com/conductor-oss/conductor/blob/main/redis-persistence/src/main/java/com/netflix/conductor/redis/dao/DynoQueueDAO.java#L96, which relies on https://github.com/Netflix/dyno-queues/blob/dev/dyno-queues-redis/src/main/java/com/netflix/dyno/queues/redis/RedisDynoQueue.java#L343

Expected behavior
In case of low activity, the pop method should wait less time, and return a task anyways instead of trying to maximise the number of tasks polled.

Why?

I think we have multiple cases, depending on the number of tasks to poll:

high activity: very high number of tasks in the queue, so long polling is barely relevant, and it's easy to poll for count number of tasks. Count is important to achieve high throughput.
very low activity: tasks arrive in the queue at a slow rate, and so long polling is relevant to improve latency, but waiting for count tasks to be there is waiting for nothing. Long polling is more relevant than count in this case, and a task should be return earlier.
medium activity: tasks arrive in the queue so that maybe 50% of workers get their count filled, and 50% of them get less. In this case, we get some tasks being processed fast, while others wait much longer. With medium activity, a slower but snappier flow would be better, and thus count should matter a bit less than long polling.

As a consequence, I think count should not be a value that should be that hard on the polling, but its impact is more important when long polling makes less sense.

I guess there are multiple ways to fix it:

pop until count is 1 or more OR until the time has elapsed
pop until count is at least half of the requested count OR until the time has elapsed.
maybe depending on the value of pop, change the strategy: if pop is like 1000, then keep the current behaviour, as the server always expects high load and throughput is preferred. If count is low (like 3), then does it as a best effort with returning up to 3 within the time period (just like 1).

What do you think?

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Latency tradeoff could be better when polling with a long timeout and count > 1 in MySQL and Postgres persistences #142

Latency tradeoff could be better when polling with a long timeout and count > 1 in MySQL and Postgres persistences #142

Jiehong commented Apr 26, 2024

Latency tradeoff could be better when polling with a long timeout and count > 1 in MySQL and Postgres persistences #142

Latency tradeoff could be better when polling with a long timeout and count > 1 in MySQL and Postgres persistences #142

Comments

Jiehong commented Apr 26, 2024