Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

~30s of request queuing when promoting canary to production #768

Open
victorlin opened this issue Dec 18, 2023 · 2 comments
Open

~30s of request queuing when promoting canary to production #768

victorlin opened this issue Dec 18, 2023 · 2 comments
Labels
bug Something isn't working

Comments

@victorlin
Copy link
Member

victorlin commented Dec 18, 2023

Recently, I've noticed that promoting canary to production prevents nextstrain.org from loading for a short but noticeable amount of time.

With the latest promotion of 24ba9ee (nextstrain-server v894 → v895), I paid extra attention to this. Here is a breakdown of the time it took to load https://nextstrain.org on a web browser in two scenarios. The requests took ~30 seconds and were initiated about 10 seconds after the promotion completed successfully, meaning the total downtime was around 40 seconds:

load times

Issue title says "local" downtime because I'm not sure if it's just my connection or if this can be observed by everyone.

@victorlin victorlin added the bug Something isn't working label Dec 18, 2023
@victorlin
Copy link
Member Author

The automated build of 24ba9ee on canary showed this warning (GitHub, Heroku), which may be related:

Warning: Your slug size (313 MB) exceeds our soft limit (300 MB) which may affect boot time.

@tsibley
Copy link
Member

tsibley commented Jan 10, 2024

I've noticed this and believe it's due to how Heroku's routing layer switches things over a bit early when cutting between the old dynos and new dynos. I wouldn't call it downtime, though. There's a short period of time when new requests will queue up waiting for the new dyno to be ready and take longer to get a response, but no requests should fail.

I haven't looked into minimizing that time; slug size might be implicated, or our code's own startup time. I also wonder if we could have Heroku's routing layer hold on directing requests to the new dyno until after an app-level health check passes (as opposed to the dyno-level health check it seems to use now).

@tsibley tsibley changed the title ~30s local downtime when promoting canary to production ~30s of request queuing when promoting canary to production Jan 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants