New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ApiListener#AddConnection(): start connect timeout after DNS resolution #9982
base: master
Are you sure you want to change the base?
Conversation
ASIO DNS resolution can't be cancelled. If it takes longer than ApiListener#connect_timeout, the latter becomes a no-op as it cancels socket I/O which doesn't even start before DNS resolution.
My idea for This PR improves that if the timeout triggers, the cancellation happens more reliably, but it also makes it less predictable when the timeout is supposed to trigger.
Please also give some context and references for this on GitHub so that this is easier to follow for anyone just looking at this PR (which might even be one of us in a year again). Yes, we will likely have to work around Asio limitations here in some way or another. I'm not sure if this is the best approach, but if we keep it, there should at least be some indication in the logs that something with name resolution is messed up if it takes too long. |
To say it with Churchill, it's even a horrible step. But all others found so far are even more horrible. At least it's better than resolving in a fork(2)ed process and killing the latter on timeout. https://lists.boost.org/Archives/boost/2021/08/251914.php
Not entirely, the connect_timeout now restricts how long it can take to connect to Y. Not how long it can take to resolve X to Y. (A bit eyewash, I know, but technically the truth.)
https://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio/reference/ip__basic_resolver/cancel.html "forces the completion of any pending asynchronous operations", but what does pending mean? As f00d131 showed, it doesn't mean ongoing as such aren't cancelled. Right now async_resolve() starts a new background thread and does effectively resolve() there, via libc. The latter has not even been designed to get cancelled.
ASIO is using libc under the hood which timeouts by itself. The latter results in an error which is logged as shown by my tests (OP). |
That one suggests there's at most one name resolution going on at any given time. Is this indeed the case and wouldn't that mean that the timeouts just add up if there are multiple requests? So what happens if there are many endpoints being connected to? So calling
Another option to look into could be to perform the name resolution in a separate coroutine that the current connect coroutine waits for with a timeout we can control. If that new coroutine receives a response/error at a later time, it would discard it and probably log something that name resolution is funky. That should at least give a consistent timeout behavior for the reconnect attempts. The challenging part would be dealing with the case that a timeouting resolution could potentially block another resolution from even being attempted within its timeout and to provide helpful logging in this case. |
Despite we create a new resolver each time, the resolvings happen strictly sequentially, even for 1000+ endpoints:
Wouldn't help with anything. There's one query per resolver which is not cancelled as it's ongoing.
Exactly! Timeouting ASIO resolves always occupy resources for slots of time. Except, if we actually follow https://lists.boost.org/Archives/boost/2021/08/251914.php advice and make something like our process spawner, but for DNS. Like, forked from umbrella/main ASAP and forks by itself for each query (ex. pure IPs which the main process "resolves" by itself?). Timeouts would be cleaned up by kill(2). OK? |
Yes, it should help with something: logging. In your example logs below, if you search for
That would force the cancellation behavior one would expect, but apart from that, would this be much better than abandoning connection attempts after |
No, I don't think so. We make a new resolver for every query. Then we start that query. As my tests have shown, f00d131 -i.e. cancelling the resolver- does literally nothing to that query. And others use their own resolvers.
If it's that broken, sure. But at least the log messages would arrive in time, not one every 20s. I mean, just look at my logs above. This is insane! Also, if it's not completely broken, queries would take their X seconds to succeed w/o blocking/timeouting each other. In addition, processes can be kill(2)ed with SIGKILL and disappear immediately – no resources wasted. If we also resolve IPs in the main process with AI_NUMERICHOST|AI_NUMERICSERV (never actually blocking, I guess), customers w/ Endpoint#host=IP won't even notice anything. This is our new best option. |
What exactly have you tested? If you do
But that's exactly the logging you would get at the moment if the error was something like "connection refused". If a connection is delayed by a long time with absolutely no indication why, that's only confusing. |
Nothing. OK, technically it cancels "any asynchronous operations that are waiting on the resolver". But there are none as there's only one query per resolver.
I've separated performing resolution and connection, I can also separate their logging to reduce confusion if you wish: "Resolving 'kli.mov:5665'. (May take a long time.)" |
ASIO DNS resolution can't be cancelled. If it takes longer than
ApiListener#connect_timeout, the latter becomes a no-op as it cancels
socket I/O which doesn't even start before DNS resolution.
Before
👎 Timeout "fires" after 15s w/o any effect, DNS times out by itself after another 5s.
After
👍 The timeout doesn't even start if DNS fails.
👍 The timeout starts if DNS succeeds and fires immediately once due.