Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

file descriptor leak #3729

Open
deevope opened this issue Jul 11, 2022 · 9 comments
Open

file descriptor leak #3729

deevope opened this issue Jul 11, 2022 · 9 comments

Comments

@deevope
Copy link
Contributor

deevope commented Jul 11, 2022

20220710 06:47:18.596 ERROR grin_util::logger - 
thread 'peer_connect' panicked at 'clone conn for reader failed: Os { code: 24, kind: Uncategorized, message: "Too many open files" }': p2p/src/conn.rs:224   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: <unknown>
  12: <unknown>
  13: <unknown>
  14: <unknown>
  15: <unknown>
@yeastplume
Copy link
Member

paste
paste (1)

This image shows file descriptors being left open in /proc/[pid/fd, which when counted just happen to be at the user FD limit.

To reproduce, I'd try running a node on a closed network and keep an eye on the number of socket descriptors in /proc/[pid]/fd, then connect/disconnect peers while checking the count of file descriptors in that directory. Then whatever fix is needed can also be tested that way.

Would focus on trying to reproduce in the pibd_impl branch, since it's due to be merged (and won't be merged until this is addressed)

@s7500
Copy link

s7500 commented Aug 2, 2022

I got down to doing a task on my own server and decided to follow the note
However, I apparently do not fully understand the technology of connecting peers
In the config I have the following settings:

[server.p2p_config]
host = "0.0.0.0"
port = 3414

But when I make:
./grin client unban -p 0.0.0.0:3414
or
./grin client unban -p 127.0.0.1:3414
I have no new peers in my grin server

I up grin server in usernet

@yeastplume
Copy link
Member

yeastplume commented Aug 24, 2022

I have this reproduced after running a node fresh for a couple of weeks, we can see the file handles being left oper:

image

The last message in the logs before this starts happening as to do with header sync being enabled, seen this on mainnet and on a testnet node that's starting to leak threads

image

In any case, the debugging the process reveals plenty of stuck threads that correspond to the open file handles:

image

With backtrace as follows:

image

@yeastplume
Copy link
Member

yeastplume commented Aug 24, 2022

Threads seem to be stuck trying to obtain a read lock of the header pmmr:

Chain::process_block_header, thread peer_read, stuck waiting for exclusive lock on either header_pmmr or txhashset

image

peer_write thread appears to be parked and waiting .. think this is the only instance of a write lock for the header pmmr, so good possibility this is what's blocking all the other threads

image

chain::Chain::get_header_by_height, thread 'compactor', also appears to be awaiting a read lock on header_pmmr

image

Chain::get_locator_hashes, thread sync, also appears to be waiting for a write lock on header_pmmr

image

NetToChainAdapter, thread peer_read, wating for read lock on header_pmmr

image

@yeastplume
Copy link
Member

Also one instance of an exclusive lock waiting for all readers to complete, grin_chain::chain::Chain::process_block_header

image

@yeastplume
Copy link
Member

In normal operations, when a node falls 5 blocks behind, header sync kicks off followed by a message that syncronization was successful.

20220824 12:49:20.586 INFO grin_servers::grin::sync::syncer - sync: total_difficulty 1965578938893570, peer_difficulty 1965580349437861, threshold 872616680 (last 5 blocks), enabling sync
20220824 12:49:20.590 INFO grin_servers::common::adapters - Received 8 block headers from 172.26.14.223:3414
20220824 12:49:30.595 INFO grin_servers::common::adapters - Received 10 block headers from 172.26.12.78:3414
20220824 12:49:31.898 INFO grin_servers::grin::sync::syncer - synchronized at 1965580705767519 @ 1889250 [000379751ed9]

On nodes where the issue occurs, we see sync being enabled, followed by a 'no outbound peers, considering inbound' message, and then nothing until we start seeing too many open files errors. This would likely indicate the issue is the header_pmmr being write locked in the sync thread yet is unable to complete header syncing.

@droid192
Copy link

same issue, more log data:

20230215 16:09:58.375 ERROR grin_util::logger - 
thread 'peer_connect' panicked at 'clone conn for reader failed: Os { code: 24, kind: Uncategorized, message: "Too many open files" }': p2p/src/conn.rs:224   0: grin_util::logger::send_panic_to_log::{{closure}}
   1: std::panicking::rust_panic_with_hook
             at /rustc/abc...abc/library/std/src/panicking.rs:610
   2: std::panicking::begin_panic_handler::{{closure}}
             at /rustc/abc...abc/library/std/src/panicking.rs:502
   3: std::sys_common::backtrace::__rust_end_short_backtrace
             at /rustc/abc...abc/library/std/src/sys_common/backtrace.rs:139
   4: rust_begin_unwind
             at /rustc/abc...abc/library/std/src/panicking.rs:498
   5: core::panicking::panic_fmt
             at /rustc/abc...abc/library/core/src/panicking.rs:116
   6: core::result::unwrap_failed
             at /rustc/abc...abc/library/core/src/result.rs:1690
   7: grin_p2p::conn::listen
   8: grin_p2p::peer::Peer::new
   9: grin_p2p::peer::Peer::connect
  10: grin_p2p::serv::Server::connect
  11: std::sys_common::backtrace::__rust_begin_short_backtrace
  12: core::ops::function::FnOnce::call_once{{vtable.shim}}
  13: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/abc...abc/library/alloc/src/boxed.rs:1854
      <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once
             at /rustc/abc...abc/library/alloc/src/boxed.rs:1854
      std::sys::unix::thread::Thread::new::thread_start
             at /rustc/abc...abc/library/std/src/sys/unix/thread.rs:108
  14: start_thread
  15: clone

@phyro
Copy link
Member

phyro commented Feb 24, 2023

@droid192 I believe the fix for it was the last commit in #3695 which is 3524b70. Are you using the latest version https://github.com/mimblewimble/grin/releases/tag/v5.2.0-alpha.2 ?

@droid192
Copy link

updated 5.1 to 5.2.0-a2. So far so good.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants