Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

investigation: hanging TCPSocketWrap in js-libp2p-amino-dht-bootstrapper #2537

Open
SgtPooki opened this issue May 9, 2024 · 1 comment
Open
Labels
need/triage Needs initial labeling and prioritization

Comments

@SgtPooki
Copy link
Member

SgtPooki commented May 9, 2024

related to libp2p/js-libp2p-amino-dht-bootstrapper#18 (comment)

I will be adding various observations here while looking around at connection open and close code in js-libp2p

Destroying connections on MultiaddrConnection.close

1. timeout events not handled

socket.once('close', () => {
// socket completely closed
log('%s socket closed', lOptsStr)
resolve()
})
socket.once('error', (err: Error) => {
log('%s socket error', lOptsStr, err)
// error closing socket
if (maConn.timeline.close == null) {
maConn.timeline.close = Date.now()
}
if (!socket.destroyed) {
reject(err)
}
// if socket is destroyed, 'closed' event will be emitted later to resolve the promise
})
// shorten inactivity timeout
socket.setTimeout(closeTimeout)

When maConn.close is called, we try to explicitly close a socket, handling close and error events, but not timeout events which can definitely occur.

From https://nodejs.org/docs/latest-v20.x/api/net.html#socketsettimeouttimeout-callback

When an idle timeout is triggered the socket will receive a 'timeout' event but the connection will not be severed. The user must manually call socket.end() or socket.destroy() to end the connection.

This seems like it could easily lead to TCPSocketWrap continuing to increase.

Questions
  1. By calling setTimeout again, does it still allow the callback defined at https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/src/socket-to-conn.ts#L62C1-L75C5 to trigger? (it seems like that's the intention).
  2. If the original timeout event handler is called, and then we call maConn.close, will that same handler be called again or is it removed?

2. Manual socket.destroySoon logic

if (socket.writableLength > 0) {
// there are outgoing bytes waiting to be sent
socket.once('drain', () => {
log('%s socket drained', lOptsStr)
// all bytes have been sent we can destroy the socket (maybe) before the timeout
socket.destroy()
})
} else {
// nothing to send, destroy immediately, no need for the timeout
socket.destroy()
}

I'm not sure if we're intentionally not using socket.destroySoon but AFAIK, the only thing we would be missing by switching this code to socket.destroySoon is the log('%s socket drained', lOptsStr) log line, which we could easily add up above.

3. Socket 'drop' event

// With server.maxConnections the TCP socket is created and the initial handshake is completed
// Then in the server handler NodeJS javascript code will call socket.emit('drop') if over the limit
// https://github.com/nodejs/node/blob/fddc701d3c0eb4520f2af570876cc987ae6b4ba2/lib/net.js#L1706

We have some tests that test our maxConnections, but we have no code that handles the 'drop' event in https://github.com/libp2p/js-libp2p/blob/4ad63bb79c0c2e5c670b32aa534906b923dcf150/packages/transport-tcp/src/socket-to-conn.ts.

Questions
  1. Is it possible that when reaching maxConnections & experiencing socket aborts/errors/timeouts, some of those dropped Sockets are not closed out properly?

4. Socket.unref

I don't see that we're calling socket.unref anywhere, and adding it kind of feels like a hack, but I think there are a few locations maConn.close and others where we could call socket.unref() to make sure we're explicit about when the socket is no longer needed.

e.g. once we call maConn.close and 'drain' is emitted, socket.unref() could be safely called?

Holistic Socket management

1. Transport class vs toMaConn socket event cleanup

const done = (err?: any): void => {
rawSocket.removeListener('error', onError)
rawSocket.removeListener('timeout', onTimeout)
rawSocket.removeListener('connect', onConnect)

TCP Transport class cleans up events it adds to sockets, why doesn't socket-to-conn.ts or listener.ts?

Questions
  1. Should we clean up allEventListeners in the TCP transport class?
  2. should socket-to-conn.ts be removing event handlers it adds, and if so, when?
  3. listener.ts doesn't remove event handlers either, should it?
@SgtPooki
Copy link
Member Author

After running js-libp2p-amino-dht-bootstrapper over the weekend, starting with heapsnapshot->gc->heapsnapshot, then taking snapshots every 6 hours, and finishing this morning with heap->gc->heap, memlab gives the following result for analyze unbound-object:

╰─ ✔ ❯ ls -lhatr snapshot-dir
total 4.7G
-rw-r--r--  1 sgtpooki    0 May 10 19:00 .gitkeep
drwxr-xr-x 35 sgtpooki 1.1K May 10 19:07 ..
-rw-------  1 sgtpooki  37M May 10 19:20 2024-05-11T02:20:05.589Z.heapsnapshot
-rw-------  1 sgtpooki  42M May 10 19:20 2024-05-11T02:20:14.463Z.heapsnapshot
-rw-------  1 sgtpooki 129M May 10 19:22 2024-05-11T02:22:16.103Z.heapsnapshot
-rw-------  1 sgtpooki 120M May 11 00:00 2024-05-11T07:00:00.864Z.heapsnapshot
-rw-------  1 sgtpooki 197M May 11 06:00 2024-05-11T13:00:00.809Z.heapsnapshot
-rw-------  1 sgtpooki 314M May 11 12:00 2024-05-11T19:00:00.832Z.heapsnapshot
-rw-------  1 sgtpooki 337M May 11 18:00 2024-05-12T01:00:01.187Z.heapsnapshot
-rw-------  1 sgtpooki 391M May 12 00:00 2024-05-12T07:00:00.614Z.heapsnapshot
-rw-------  1 sgtpooki 483M May 12 06:00 2024-05-12T13:00:00.576Z.heapsnapshot
-rw-------  1 sgtpooki 438M May 12 12:00 2024-05-12T19:00:00.633Z.heapsnapshot
-rw-------  1 sgtpooki 468M May 12 18:00 2024-05-13T01:00:00.297Z.heapsnapshot
-rw-------  1 sgtpooki 497M May 13 00:00 2024-05-13T07:00:00.871Z.heapsnapshot
-rw-------  1 sgtpooki 466M May 13 06:00 2024-05-13T13:00:00.403Z.heapsnapshot
-rw-------  1 sgtpooki 473M May 13 07:50 2024-05-13T14:50:13.471Z.heapsnapshot
drwxr-xr-x 18 sgtpooki  576 May 13 07:51 .
-rw-------  1 sgtpooki 235M May 13 07:51 2024-05-13T14:51:12.831Z.heapsnapshot

╰─ ✘ INT ❯ npx -y memlab analyze unbound-object --snapshot-dir snapshot-dir
Top growing objects in sizes:
 (Use `memlab trace --node-id=@ID` to get trace)

· system / Context [object](@387349):  3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB > 3.3MB
· MplexStreamMuxer [object](@347089):  32KB > 47KB > 47.2KB > 49.2KB > 150.7KB > 168.4KB > 231.4KB > 318KB > 423KB > 510KB > 527.4KB > 562.1KB > 596.7KB > 614.2KB > 614.2KB
· Object { initiators, receivers } [object](@347195):  21.1KB > 38KB > 38KB > 31.3KB > 132.8KB > 150.4KB > 213.4KB > 300KB > 405.1KB > 492.1KB > 509.5KB > 544.2KB > 578.7KB > 596.2KB > 596.3KB
· Map [object](@347355):  20.7KB > 37.7KB > 37.7KB > 31KB > 132.6KB > 150.2KB > 213.2KB > 299.8KB > 404.9KB > 491.9KB > 509.3KB > 544KB > 578.5KB > 596KB > 596.1KB
· system / Context [object](@122957):  531.8KB > 532KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB > 532.1KB
· ChaCha20Poly1305 [object](@122961):  530KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB > 530.1KB
· system / Context [object](@443697):  3.7KB > 3.7KB > 3.7KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB > 352.1KB
· Map [object](@709657):  184 bytes > 184 bytes > 184 bytes > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB > 348.3KB
· Providers [object](@382375):  5.9KB > 27.6KB > 89.9KB > 390KB > 362.5KB > 374.8KB > 278.4KB > 227.4KB > 214.7KB > 226.7KB > 282.6KB > 259.7KB > 275.1KB > 206.3KB > 217.6KB
· RoutingTable [object](@350995):  5.6KB > 52.6KB > 6.2KB > 6.6KB > 6.6KB > 6.6KB > 297.6KB > 274KB > 6.6KB > 6.6KB > 303.9KB > 275.4KB > 284.9KB > 235.8KB > 217.5KB
· Object { clear, get, has, remove, set } [object](@382403):  1.8KB > 23.3KB > 85.7KB > 385.5KB > 358.4KB > 370.3KB > 273.9KB > 222.9KB > 210.2KB > 222.2KB > 278KB > 255KB > 270.4KB > 201.7KB > 212.9KB
· KBucket [object](@351629):  33.3KB > 46.7KB > 54.3KB > 249.1KB > 254KB > 146.7KB > 291KB > 267.3KB > 262KB > 262KB > 297.2KB > 268.8KB > 278.3KB > 229.1KB > 210.9KB
· system / Context [object](@382739):  1.4KB > 22.2KB > 84.5KB > 384KB > 356.5KB > 368.3KB > 271.8KB > 220.7KB > 207.9KB > 220KB > 275.8KB > 252.8KB > 268.2KB > 199.4KB > 210.7KB
· Object { contacts, dontSplit, left, right } [object](@382903):  28.2KB > 41.2KB > 49KB > 243.2KB > 247.6KB > 141.1KB > 284.1KB > 260.6KB > 255.6KB > 255.6KB > 290.2KB > 261.9KB > 271.2KB > 222.5KB > 203.8KB
· Object { contacts, dontSplit, left, right } [object](@382943):  21.9KB > 34.6KB > 39KB > 205.9KB > 225.7KB > 132.5KB > 245.1KB > 223.2KB > 215.1KB > 206.4KB > 256.2KB > 221.8KB > 232.4KB > 194.8KB > 177.2KB
· AsyncGenerator [object](@59297):  33.8KB > 33.9KB > 113.3KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB > 174.9KB
· BuiltinModule [closure](@399391):  159.4KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 163.9KB > 159.4KB > 163.9KB > 163.9KB > 163.9KB > 159.4KB > 163.9KB
· Map [object](@433067):  158.1KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 162.7KB > 158.1KB > 162.7KB > 162.7KB > 162.7KB > 158.1KB > 162.7KB
· Object { contacts, dontSplit, left, right } [object](@272937):  17.3KB > 27.6KB > 27.6KB > 173.5KB > 208.6KB > 117.2KB > 214.7KB > 204.2KB > 190.8KB > 194.4KB > 218.4KB > 183.9KB > 191.7KB > 168.9KB > 150.5KB
· Object { contacts, dontSplit, left, right } [object](@272941):  5.3KB > 7.7KB > 22KB > 123.7KB > 153.8KB > 105.9KB > 159.5KB > 173.4KB > 185.1KB > 173.6KB > 161.4KB > 153.3KB > 157.3KB > 133.1KB > 138.6KB

I don't see anything obvious here.

@dhuseby dhuseby added the need/triage Needs initial labeling and prioritization label May 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need/triage Needs initial labeling and prioritization
Projects
Status: 🤨Needs Investigation
Development

No branches or pull requests

2 participants