Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SpectatorD Does Not Clean Up Sockets After Timeouts #68

Open
copperlight opened this issue Aug 30, 2023 · 3 comments
Open

SpectatorD Does Not Clean Up Sockets After Timeouts #68

copperlight opened this issue Aug 30, 2023 · 3 comments

Comments

@copperlight
Copy link
Collaborator

We saw an issue where one node had thousands of sockets in a CLOSE_WAIT state, created by the latest release version of spectatord. This particular node was in a state where networking reliability was affected - SSH connections were flaky and there were many timeouts when spectatord was attempting to POST to the Atlas backend.

[2023-08-30 14:20:55.101] [spectator] [error] Failed to POST https://$HOST/api/v4/update: Timeout was reached
[2023-08-30 14:20:55.101] [spectator] [info] HTTP timeout to https://$HOST/api/v4/update: 1000ms elapsed - connect_to=1000 read_to=2000

It looks like spectatord is not cleaning up sockets after a timeout is reached.

This bug has probably been around for awhile.

@copperlight
Copy link
Collaborator Author

I cannot reproduce this condition on an Ubuntu Jammy system with a secondary spectatord process pointing to an invalid URL, so that connect timeouts are triggered.

sudo /usr/local/bin/spectatord --admin_port 1235 --enable_socket false --port 1235 --uri http://example.com:81
[2023-09-01 15:57:25.936] [spectator] [info] Starting to send metrics to http://example.com:81 every 5s.
[2023-09-01 15:57:25.936] [spectatord] [info] Starting admin server on port 1235/tcp
[2023-09-01 15:57:25.936] [spectator] [info] Publishing metrics with the following common tags: {}
[2023-09-01 15:57:25.936] [spectatord] [info] Starting janitorial tasks
[2023-09-01 15:57:25.936] [spectatord] [info] Using receive buffer size = 16777216
[2023-09-01 15:57:25.936] [spectatord] [info] Starting spectatord server on port 1235/udp
[2023-09-01 15:57:25.936] [spectatord] [info] statsd support is not enabled
[2023-09-01 15:57:25.936] [spectatord] [info] Starting local server (dgram) on socket /run/spectatord/spectatord.unix
[2023-09-01 15:57:31.639] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:31.639] [spectator] [info] HTTP timeout to http://example.com:81: 703ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:32.341] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:32.341] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:33.042] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:33.042] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 2.108s
[2023-09-01 15:57:36.636] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:36.636] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:37.338] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:37.338] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [error] Failed to POST http://example.com:81: Timeout was reached
[2023-09-01 15:57:38.040] [spectator] [info] HTTP timeout to http://example.com:81: 701ms elapsed - connect_to=1000 read_to=2000
[2023-09-01 15:57:38.040] [spectator] [info] Sent: 0 Dropped: 25 Total: 25. Elapsed 2.105s

@copperlight
Copy link
Collaborator Author

copperlight commented Sep 20, 2023

We had another two instances demonstrate this problem again, even if we cannot recreate it locally.

The error pair log lines look like, as reported by sudo journalctl -u spectatord |grep -i timeout:

Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [error] Failed to POST https://HOSTNAME/api/v4/update: Timeout was reached
Sep 12 19:16:19 INSTANCE_ID start-spectatord[3967]: [2023-09-12 19:16:19.673] [spectator] [info] HTTP timeout to https://HOSTNAME/api/v4/update: 1597ms elapsed - connect_to=1000 read_to=2000

The slight difference here versus the repro attempt is that the actual connect duration is longer than the timeout, versus... not.

In one of the cases, the CLOSE_WAIT sockets appear to be using IP addresses that are no longer in use by either the load balancer or the aggregator clusters, which is strange. The sockets are clearly owned by the SpectatorD process, as reported by sudo netstat -tonp |grep CLOSE_WAIT.

Some commentary on cleaning up half-closed connections in libcurl:

@copperlight
Copy link
Collaborator Author

A few attempts to dig into this further:

  • Placing a curl_easy_cleanup() call after http_client.cc#L253 triggers failures in the test suite.
  • Connecting to a simple Python socket server which prematurely closes connections does not increase the number of CLOSE_WAIT sockets, and yields the following log messages:
[2023-09-21 21:25:51.705] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.705] [spectator] [info] HTTP timeout to http://localhost:2345: 8ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.707] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.707] [spectator] [info] HTTP timeout to http://localhost:2345: 1ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [error] Failed to POST http://localhost:2345: Server returned nothing (no headers, no data)
[2023-09-21 21:25:51.708] [spectator] [info] HTTP timeout to http://localhost:2345: 0ms elapsed - connect_to=1000 read_to=2000
[2023-09-21 21:25:51.708] [spectator] [info] Sent: 0 Dropped: 9 Total: 9. Elapsed 0.013s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant