You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Expected behavior
In case of low activity, the pop method should wait less time, and return a task anyways instead of trying to maximise the number of tasks polled.
Why?
I think we have multiple cases, depending on the number of tasks to poll:
high activity: very high number of tasks in the queue, so long polling is barely relevant, and it's easy to poll for count number of tasks. Count is important to achieve high throughput.
very low activity: tasks arrive in the queue at a slow rate, and so long polling is relevant to improve latency, but waiting for count tasks to be there is waiting for nothing. Long polling is more relevant than count in this case, and a task should be return earlier.
medium activity: tasks arrive in the queue so that maybe 50% of workers get their count filled, and 50% of them get less. In this case, we get some tasks being processed fast, while others wait much longer. With medium activity, a slower but snappier flow would be better, and thus count should matter a bit less than long polling.
As a consequence, I think count should not be a value that should be that hard on the polling, but its impact is more important when long polling makes less sense.
I guess there are multiple ways to fix it:
pop until count is 1 or more OR until the time has elapsed
pop until count is at least half of the requested count OR until the time has elapsed.
maybe depending on the value of pop, change the strategy: if pop is like 1000, then keep the current behaviour, as the server always expects high load and throughput is preferred. If count is low (like 3), then does it as a best effort with returning up to 3 within the time period (just like 1).
What do you think?
The text was updated successfully, but these errors were encountered:
Describe the bug
If all workers poll for tasks with a count=10 for example, and a timeout=5s, very few things get polled during low activities.
This is not really a bug, but maybe a tradeoff to clarify / change.
Details
Conductor version: 3.19.0
To Reproduce
What happens is that the DAO implementations try at least "count" messages from the queue while the timeout is not elapsed.
MySQL:
conductor/mysql-persistence/src/main/java/com/netflix/conductor/mysql/dao/MySQLQueueDAO.java
Line 349 in 55268f0
Postgres:
conductor/postgres-persistence/src/main/java/com/netflix/conductor/postgres/dao/PostgresQueueDAO.java
Line 169 in 55268f0
Dyno: https://github.com/conductor-oss/conductor/blob/main/redis-persistence/src/main/java/com/netflix/conductor/redis/dao/DynoQueueDAO.java#L96, which relies on https://github.com/Netflix/dyno-queues/blob/dev/dyno-queues-redis/src/main/java/com/netflix/dyno/queues/redis/RedisDynoQueue.java#L343
Expected behavior
In case of low activity, the pop method should wait less time, and return a task anyways instead of trying to maximise the number of tasks polled.
Why?
I think we have multiple cases, depending on the number of tasks to poll:
count
number of tasks. Count is important to achieve high throughput.count
tasks to be there is waiting for nothing. Long polling is more relevant than count in this case, and a task should be return earlier.As a consequence, I think count should not be a value that should be that hard on the polling, but its impact is more important when long polling makes less sense.
I guess there are multiple ways to fix it:
count
is 1 or more OR until the time has elapsedcount
is at least half of the requestedcount
OR until the time has elapsed.What do you think?
The text was updated successfully, but these errors were encountered: