two parts of scaling in happen at different times, leading to delays in task execution #855

benclifford · 2022-07-22T10:47:08Z

Describe the bug

I have noticed this happening at endpoint startup: one part of the code scales in an initially launched block. but another part of the code does not realise it is gone until several minutes later when timeouts happen.

In the period between those two events, no new block is launched to run submitted tasks, and instead they sit delayed until the later realisation that the block is gone.

Here are some logs I added:


1658486467.576714 2022-07-22 12:41:07 INFO Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:930 start Got container switch count: {b'431fcad26ccc': 0}
1658486468.053865 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1139 scale_in Scale in BENC
1658486468.054336 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1168 scale_in BENC: scale in by count of 1 blocks
1658486468.054443 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1174 scale_in BENC: sending hold block to block 1
1658486468.054546 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:564 hold_manager BENC: hold_manager that doesn't actually hold a manager
1658486468.054690 2022-07-22 12:41:08 WARNING Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1181 scale_in BENC: provider cancel 3 - forcibly killing block



1658486587.891700 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:998 start Too many heartbeats missed for manager b'431fcad26ccc'
1658486587.892121 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:1015 start Sent 0 failure reports, unregistering manager b'431fcad26ccc'

Not the two minute delay which i have indicated with new lines.

To Reproduce
launch an endpoint, let the initial block be shut down and then immediately send a task to that endpoint. you should see a delay of several minutes before a new block is launched and task is run.

Expected behavior
Scaling up to run the submitted task should happen immediately.

Environment
Distributed Environment
my dev environment, hacked main a9d70f1

The text was updated successfully, but these errors were encountered:

benclifford added the bug Something isn't working label Jul 22, 2022

benclifford changed the title ~~two parts of scaling in happen at different times, leading to delays~~ two parts of scaling in happen at different times, leading to delays in task execution Jul 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

two parts of scaling in happen at different times, leading to delays in task execution #855

two parts of scaling in happen at different times, leading to delays in task execution #855

benclifford commented Jul 22, 2022

two parts of scaling in happen at different times, leading to delays in task execution #855

two parts of scaling in happen at different times, leading to delays in task execution #855

Comments

benclifford commented Jul 22, 2022