Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two parts of scaling in happen at different times, leading to delays in task execution #855

Open
benclifford opened this issue Jul 22, 2022 · 0 comments
Labels
bug Something isn't working

Comments

@benclifford
Copy link
Contributor

Describe the bug

I have noticed this happening at endpoint startup: one part of the code scales in an initially launched block. but another part of the code does not realise it is gone until several minutes later when timeouts happen.

In the period between those two events, no new block is launched to run submitted tasks, and instead they sit delayed until the later realisation that the block is gone.

Here are some logs I added:


1658486467.576714 2022-07-22 12:41:07 INFO Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:930 start Got container switch count: {b'431fcad26ccc': 0}
1658486468.053865 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1139 scale_in Scale in BENC
1658486468.054336 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1168 scale_in BENC: scale in by count of 1 blocks
1658486468.054443 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1174 scale_in BENC: sending hold block to block 1
1658486468.054546 2022-07-22 12:41:08 INFO Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:564 hold_manager BENC: hold_manager that doesn't actually hold a manager
1658486468.054690 2022-07-22 12:41:08 WARNING Executor-Interchange-61302 Base-Strategy-139775095412480 funcx_endpoint.executors.high_throughput.interchange:1181 scale_in BENC: provider cancel 3 - forcibly killing block



1658486587.891700 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:998 start Too many heartbeats missed for manager b'431fcad26ccc'
1658486587.892121 2022-07-22 12:43:07 WARNING Executor-Interchange-61302 MainThread-139775182862144 funcx_endpoint.executors.high_throughput.interchange:1015 start Sent 0 failure reports, unregistering manager b'431fcad26ccc'

Not the two minute delay which i have indicated with new lines.

To Reproduce
launch an endpoint, let the initial block be shut down and then immediately send a task to that endpoint. you should see a delay of several minutes before a new block is launched and task is run.

Expected behavior
Scaling up to run the submitted task should happen immediately.

Environment
Distributed Environment
my dev environment, hacked main a9d70f1

@benclifford benclifford added the bug Something isn't working label Jul 22, 2022
@benclifford benclifford changed the title two parts of scaling in happen at different times, leading to delays two parts of scaling in happen at different times, leading to delays in task execution Jul 22, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant