Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ApiListener#AddConnection(): start connect timeout after DNS resolution #9982

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

Al2Klimov
Copy link
Member

ASIO DNS resolution can't be cancelled. If it takes longer than
ApiListener#connect_timeout, the latter becomes a no-op as it cancels
socket I/O which doesn't even start before DNS resolution.

Before

[2024-01-25 11:50:57 +0000] information/ApiListener: Reconnecting to endpoint 'aklimov-i2all-master-2' via host 'al2klimov.hopto.org' and port '5665'
[2024-01-25 11:51:12 +0000] critical/ApiListener: Timeout while reconnecting to endpoint 'aklimov-i2all-master-2' via host 'al2klimov.hopto.org' and port '5665', cancelling attempt
[2024-01-25 11:51:17 +0000] critical/ApiListener: Cannot connect to host 'al2klimov.hopto.org' on port '5665': Host not found (authoritative)

👎 Timeout "fires" after 15s w/o any effect, DNS times out by itself after another 5s.

After

[2024-01-26 11:56:16 +0000] information/ApiListener: Reconnecting to endpoint 'aklimov-i2all-master-2' via host 'al2klimov.hopto.org' and port '5665'
[2024-01-26 11:56:16 +0000] information/IcingaDB: Pending queries: 0 (Input: 1/s; Output: 1/s)
[2024-01-26 11:56:26 +0000] information/IdoMysqlConnection: 'ido-mysql' resumed.
[2024-01-26 11:56:26 +0000] information/DbConnection: Resuming IDO connection: ido-mysql
[2024-01-26 11:56:26 +0000] information/InfluxdbWriter: 'influxdb' resumed.
[2024-01-26 11:56:26 +0000] information/IdoMysqlConnection: Last update by endpoint 'aklimov-i2all-master-2' was 8.0741s ago (< failover timeout of 30s). Retrying.
[2024-01-26 11:56:36 +0000] critical/ApiListener: Cannot connect to host 'al2klimov.hopto.org' on port '5665': Host not found (authoritative)
[2024-01-26 11:56:36 +0000] information/WorkQueue: #7 (ApiListener, RelayQueue) items: 5, rate: 196.333/s (11780/min 11780/5min 11780/15min);
[2024-01-26 11:56:37 +0000] information/ApiListener: Reconnecting to endpoint 'aklimov-i2all-master-2' via host 'al2klimov.hopto.org' and port '5665'
[2024-01-26 11:56:37 +0000] information/IdoMysqlConnection: Last update by endpoint 'aklimov-i2all-master-2' was 9.16835s ago (< failover timeout of 30s). Retrying.
[2024-01-26 11:56:47 +0000] information/IdoMysqlConnection: Last update by endpoint 'aklimov-i2all-master-2' was 9.16004s ago (< failover timeout of 30s). Retrying.
[2024-01-26 11:56:57 +0000] information/IdoMysqlConnection: Last update by endpoint 'aklimov-i2all-master-2' was 9.15969s ago (< failover timeout of 30s). Retrying.
[2024-01-26 11:56:57 +0000] critical/ApiListener: Cannot connect to host 'al2klimov.hopto.org' on port '5665': Host not found (authoritative)

👍 The timeout doesn't even start if DNS fails.

[2024-01-26 13:08:52 +0100] information/ApiListener: Reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665'
[2024-01-26 13:08:52 +0100] information/CheckerComponent: 'checker' started.
[2024-01-26 13:08:52 +0100] information/ConfigItem: Activated all objects.
[2024-01-26 13:09:02 +0100] information/WorkQueue: #7 (ApiListener, SyncQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2024-01-26 13:09:02 +0100] information/WorkQueue: #6 (ApiListener, RelayQueue) items: 0, rate:  0/s (0/min 0/5min 0/15min);
[2024-01-26 13:09:07 +0100] critical/ApiListener: Timeout while reconnecting to endpoint 'al2klimov.de' via host 'al2klimov.de' and port '5665', cancelling attempt
[2024-01-26 13:09:07 +0100] critical/ApiListener: Cannot connect to host 'al2klimov.de' on port '5665': Operation canceled

👍 The timeout starts if DNS succeeds and fires immediately once due.

ASIO DNS resolution can't be cancelled. If it takes longer than
ApiListener#connect_timeout, the latter becomes a no-op as it cancels
socket I/O which doesn't even start before DNS resolution.
@Al2Klimov Al2Klimov added bug Something isn't working area/distributed Distributed monitoring (master, satellites, clients) consider backporting Should be considered for inclusion in a bugfix release labels Jan 26, 2024
@cla-bot cla-bot bot added the cla/signed label Jan 26, 2024
@julianbrost
Copy link
Contributor

My idea for connect_timeout to have a timeout for everything from either Icinga decides to connect to and endpoint or from accepting an incoming connection to the connection is ready for use. #8769 made the change that there's a timeout covering more than just the TLS handshake (as it was the case), obviously it didn't do a perfect job and missed the host lookup. But now simply exempting that lookup from the timeout doesn't feel like a step in the right direction.

This PR improves that if the timeout triggers, the cancellation happens more reliably, but it also makes it less predictable when the timeout is supposed to trigger.

ASIO DNS resolution can't be cancelled.

Please also give some context and references for this on GitHub so that this is easier to follow for anyone just looking at this PR (which might even be one of us in a year again).

Yes, we will likely have to work around Asio limitations here in some way or another. I'm not sure if this is the best approach, but if we keep it, there should at least be some indication in the logs that something with name resolution is messed up if it takes too long.

@Al2Klimov
Copy link
Member Author

simply exempting that lookup from the timeout doesn't feel like a step in the right direction.

To say it with Churchill, it's even a horrible step. But all others found so far are even more horrible.

At least it's better than resolving in a fork(2)ed process and killing the latter on timeout. https://lists.boost.org/Archives/boost/2021/08/251914.php

This PR improves that if the timeout triggers, the cancellation happens more reliably, but it also makes it less predictable when the timeout is supposed to trigger.

Not entirely, the connect_timeout now restricts how long it can take to connect to Y. Not how long it can take to resolve X to Y. (A bit eyewash, I know, but technically the truth.)

ASIO DNS resolution can't be cancelled.

Please also give some context and references for this on GitHub so that this is easier to follow for anyone just looking at this PR (which might even be one of us in a year again).

chriskohlhoff/asio#449

https://www.boost.org/doc/libs/1_66_0/doc/html/boost_asio/reference/ip__basic_resolver/cancel.html "forces the completion of any pending asynchronous operations", but what does pending mean? As f00d131 showed, it doesn't mean ongoing as such aren't cancelled. Right now async_resolve() starts a new background thread and does effectively resolve() there, via libc. The latter has not even been designed to get cancelled.

there should at least be some indication in the logs that something with name resolution is messed up if it takes too long.

ASIO is using libc under the hood which timeouts by itself. The latter results in an error which is logged as shown by my tests (OP).

@julianbrost
Copy link
Contributor

chriskohlhoff/asio#449

That one suggests there's at most one name resolution going on at any given time. Is this indeed the case and wouldn't that mean that the timeouts just add up if there are multiple requests? So what happens if there are many endpoints being connected to? So calling cancel() on the resolver might not be a bad idea after all.

To say it with Churchill, it's even a horrible step. But all others found so far are even more horrible.

Another option to look into could be to perform the name resolution in a separate coroutine that the current connect coroutine waits for with a timeout we can control. If that new coroutine receives a response/error at a later time, it would discard it and probably log something that name resolution is funky. That should at least give a consistent timeout behavior for the reconnect attempts. The challenging part would be dealing with the case that a timeouting resolution could potentially block another resolution from even being attempted within its timeout and to provide helpful logging in this case.

@Al2Klimov
Copy link
Member Author

chriskohlhoff/asio#449

That one suggests there's at most one name resolution going on at any given time. Is this indeed the case and wouldn't that mean that the timeouts just add up if there are multiple requests? So what happens if there are many endpoints being connected to?

Despite we create a new resolver each time, the resolvings happen strictly sequentially, even for 1000+ endpoints:

[2024-01-29 10:58:57 +0000] critical/ApiListener: Cannot connect to host '821.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 10:59:17 +0000] critical/ApiListener: Cannot connect to host '10.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 10:59:37 +0000] critical/ApiListener: Cannot connect to host '883.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 10:59:57 +0000] critical/ApiListener: Cannot connect to host '528.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:00:17 +0000] critical/ApiListener: Cannot connect to host '364.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:00:37 +0000] critical/ApiListener: Cannot connect to host '642.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:00:57 +0000] critical/ApiListener: Cannot connect to host '717.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:01:17 +0000] critical/ApiListener: Cannot connect to host '837.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:01:38 +0000] critical/ApiListener: Cannot connect to host '570.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:01:58 +0000] critical/ApiListener: Cannot connect to host '242.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:02:18 +0000] critical/ApiListener: Cannot connect to host '247.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:02:38 +0000] critical/ApiListener: Cannot connect to host '848.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:02:58 +0000] critical/ApiListener: Cannot connect to host '538.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:03:18 +0000] critical/ApiListener: Cannot connect to host '496.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:03:38 +0000] critical/ApiListener: Cannot connect to host '793.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:03:58 +0000] critical/ApiListener: Cannot connect to host '110.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:04:18 +0000] critical/ApiListener: Cannot connect to host '904.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:04:38 +0000] critical/ApiListener: Cannot connect to host '807.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:04:58 +0000] critical/ApiListener: Cannot connect to host '578.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:05:18 +0000] critical/ApiListener: Cannot connect to host '757.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:05:38 +0000] critical/ApiListener: Cannot connect to host '187.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:05:58 +0000] critical/ApiListener: Cannot connect to host '734.example.com' on port '5665': Host not found (authoritative)
[2024-01-29 11:06:18 +0000] critical/ApiListener: Cannot connect to host '152.example.com' on port '5665': Host not found (authoritative)

So calling cancel() on the resolver might not be a bad idea after all.

Wouldn't help with anything. There's one query per resolver which is not cancelled as it's ongoing.

To say it with Churchill, it's even a horrible step. But all others found so far are even more horrible.

Another option to look into could be to perform the name resolution in a separate coroutine that the current connect coroutine waits for with a timeout we can control. If that new coroutine receives a response/error at a later time, it would discard it and probably log something that name resolution is funky. That should at least give a consistent timeout behavior for the reconnect attempts. The challenging part would be dealing with the case that a timeouting resolution could potentially block another resolution from even being attempted within its timeout and to provide helpful logging in this case.

Exactly! Timeouting ASIO resolves always occupy resources for slots of time. Except, if we actually follow https://lists.boost.org/Archives/boost/2021/08/251914.php advice and make something like our process spawner, but for DNS. Like, forked from umbrella/main ASAP and forks by itself for each query (ex. pure IPs which the main process "resolves" by itself?). Timeouts would be cleaned up by kill(2). OK?

@julianbrost
Copy link
Contributor

So calling cancel() on the resolver might not be a bad idea after all.

Wouldn't help with anything. There's one query per resolver which is not cancelled as it's ongoing.

Yes, it should help with something: logging. In your example logs below, if you search for 152.example.com, there will be a big gap between the start of the reconnect attempt and some kind of result. It doesn't seem implausible of a backlog in the order of an hour to build up if things are seriously messed up. If the resolver was canceled instead, there should at least be some log message for that connection within a reasonable amount of time (so either connect_timeout or getaddrinfo()-internal timeouts).

Except, if we actually follow lists.boost.org/Archives/boost/2021/08/251914.php advice and make something like our process spawner, but for DNS. Like, forked from umbrella/main ASAP and forks by itself for each query (ex. pure IPs which the main process "resolves" by itself?). Timeouts would be cleaned up by kill(2). OK?

That would force the cancellation behavior one would expect, but apart from that, would this be much better than abandoning connection attempts after connect_timeout with some new error messages saying something like "couldn't connect, name resolution timed out" or "couldn't connect, name resolution blocked by a previous query". If your DNS resolver is that broken, moving things into another process won't magically make it work again.

@Al2Klimov
Copy link
Member Author

So calling cancel() on the resolver might not be a bad idea after all.

Wouldn't help with anything. There's one query per resolver which is not cancelled as it's ongoing.

Yes, it should help with something: logging. In your example logs below, if you search for 152.example.com, there will be a big gap between the start of the reconnect attempt and some kind of result. It doesn't seem implausible of a backlog in the order of an hour to build up if things are seriously messed up. If the resolver was canceled instead, there should at least be some log message for that connection within a reasonable amount of time (so either connect_timeout or getaddrinfo()-internal timeouts).

No, I don't think so. We make a new resolver for every query. Then we start that query. As my tests have shown, f00d131 -i.e. cancelling the resolver- does literally nothing to that query. And others use their own resolvers.

Except, if we actually follow lists.boost.org/Archives/boost/2021/08/251914.php advice and make something like our process spawner, but for DNS. Like, forked from umbrella/main ASAP and forks by itself for each query (ex. pure IPs which the main process "resolves" by itself?). Timeouts would be cleaned up by kill(2). OK?

That would force the cancellation behavior one would expect, but apart from that, would this be much better than abandoning connection attempts after connect_timeout with some new error messages saying something like "couldn't connect, name resolution timed out" or "couldn't connect, name resolution blocked by a previous query". If your DNS resolver is that broken, moving things into another process won't magically make it work again.

If it's that broken, sure. But at least the log messages would arrive in time, not one every 20s. I mean, just look at my logs above. This is insane! Also, if it's not completely broken, queries would take their X seconds to succeed w/o blocking/timeouting each other. In addition, processes can be kill(2)ed with SIGKILL and disappear immediately – no resources wasted. If we also resolve IPs in the main process with AI_NUMERICHOST|AI_NUMERICSERV (never actually blocking, I guess), customers w/ Endpoint#host=IP won't even notice anything. This is our new best option.

@julianbrost
Copy link
Contributor

No, I don't think so. We make a new resolver for every query. Then we start that query. As my tests have shown, f00d131 -i.e. cancelling the resolver- does literally nothing to that query. And others use their own resolvers.

What exactly have you tested? If you do r1.async_resolve() and r2.async_resolve() in short succession and r1 has actually started performing the lookup and r2 is now waiting for r1 to finish, what does r2.cancel() do?

But at least the log messages would arrive in time, not one every 20s. I mean, just look at my logs above. This is insane!

But that's exactly the logging you would get at the moment if the error was something like "connection refused". If a connection is delayed by a long time with absolutely no indication why, that's only confusing.

@Al2Klimov
Copy link
Member Author

No, I don't think so. We make a new resolver for every query. Then we start that query. As my tests have shown, f00d131 -i.e. cancelling the resolver- does literally nothing to that query. And others use their own resolvers.

What exactly have you tested? If you do r1.async_resolve() and r2.async_resolve() in short succession and r1 has actually started performing the lookup and r2 is now waiting for r1 to finish, what does r2.cancel() do?

Nothing. OK, technically it cancels "any asynchronous operations that are waiting on the resolver". But there are none as there's only one query per resolver.

But at least the log messages would arrive in time, not one every 20s. I mean, just look at my logs above. This is insane!

But that's exactly the logging you would get at the moment if the error was something like "connection refused". If a connection is delayed by a long time with absolutely no indication why, that's only confusing.

I've separated performing resolution and connection, I can also separate their logging to reduce confusion if you wish: "Resolving 'kli.mov:5665'. (May take a long time.)"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/distributed Distributed monitoring (master, satellites, clients) bug Something isn't working cla/signed consider backporting Should be considered for inclusion in a bugfix release
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants