Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible memory leak on new_server_connection #499

Open
dinosaure opened this issue Nov 3, 2022 · 3 comments
Open

Possible memory leak on new_server_connection #499

dinosaure opened this issue Nov 3, 2022 · 3 comments

Comments

@dinosaure
Copy link
Member

Hi,

I currently try to stress-test an unikernel which uses the mirage-tcpip stack and see if we don't have a memory leak. I audited my application to check if new connections are properly closed - and by this way, check that if we have a memory leak, it's probably due to the underlying layer, in my case, mirage-tcpip.

The stress test is about a simple relay (compiled for Solo5/hvt and available here: bob-hvt). You can check the source code here: https://github.com/dinosaure/bob. The situation is: I try to launch 512 clients and check if they can find their peers from a shared password. On the other side, I monitor the relay.

The script (bash + parallel) to launch the stress is here:

#!/bin/bash

MAX=256
LEN=16
IP="10.0.0.2" # IP of the relay

PASSWORDS=()
for i in $(seq $MAX); do
	P=$(echo $RANDOM | base64 | head -c $LEN);
	while [[ " ${PASSWORDS[*]} " =~ " ${P} " ]]; do
		P=$(echo $RANDOM | base64 | head -c $LEN);
	done;
	PASSWORDS+=( $P )
done

PASSWORDS=($(printf "%s\n" "${PASSWORDS[@]}" | sort -u))

IFS=$'\n'
COMMANDS=()
FILES=()
for i in ${!PASSWORDS[@]}; do
	SEND="./bob.com send -r $IP file.txt -p ${PASSWORDS[$i]} -vvv 2> output.send.$i > /dev/null";
	RECV="./bob.com recv -r $IP ${PASSWORDS[$i]} -vvv -o file.txt.$i 2> output.recv.$i > /dev/null";
	COMMANDS+=( $SEND )
	COMMANDS+=( $RECV )
	FILES+=( "file.txt.$i" )
done

COMMANDS=( $(shuf -e "${COMMANDS[@]}") )

parallel -j $MAX ::: ${COMMANDS[@]}

HASHES=($(sha256sum ${FILES[@]}))
rm ${FILES[@]}

for hash in ${HASHES[@]}; do
	echo $hash
done

It requires bob.com as the client, a random file.txt and the relay deployed somewhere (by default, we use one deployed at osau.re, my server). If you launch the stress-test, you can see that the relay lets clients to solve the initial handshake and allocate a secure room to let associated peers to share something (in my example, a simple file).

If I launch the relay with 40M, you can launch only one time such script. The next time will raise an Out_of_memory exception on the relay side. I tried to inrospect this memory leak. bob is quite simple (no TLS, just a bunch of packets between peers). I used mirage-memprof to introspect the memory used by my relay. You must change the config.ml and the unikernel.ml such as:

diff --git a/mirage/config.ml b/mirage/config.ml
index 01037e1..03874db 100644
--- a/mirage/config.ml
+++ b/mirage/config.ml
@@ -21,10 +21,12 @@ let bob =
         package "bob" ~pin:"git+https://github.com/dinosaure/bob.git";
         package "spoke" ~sublibs:[ "core" ];
         package "psq";
+        package "memtrace-mirage";
       ]
     ~keys:[ Key.v port; Key.v secure_port ]
-    (time @-> stackv4v6 @-> job)
+    (pclock @-> time @-> stackv4v6 @-> job)
 
+let pclock = default_posix_clock
 let time = default_time
 let stackv4v6 = generic_stackv4v6 default_network
-let () = register "bob" [ bob $ time $ stackv4v6 ]
+let () = register "bob" [ bob $ pclock $ time $ stackv4v6 ]
diff --git a/mirage/unikernel.ml b/mirage/unikernel.ml
index 1f34edf..bea018b 100644
--- a/mirage/unikernel.ml
+++ b/mirage/unikernel.ml
@@ -52,7 +52,7 @@ let serve_when_ready :
 
 let always v _ = v
 
-module Make (Time : Mirage_time.S) (Stack : Tcpip.Stack.V4V6) = struct
+module Make (Pclock : Mirage_clock.PCLOCK) (Time : Mirage_time.S) (Stack : Tcpip.Stack.V4V6) = struct
   let service ~port tcp =
     let queue = Queue.create () in
     let condition = Lwt_condition.create () in
@@ -437,8 +437,13 @@ module Make (Time : Mirage_time.S) (Stack : Tcpip.Stack.V4V6) = struct
   end
 
   module Bob_clear = Make (Stack.TCP)
+  module Memtrace = Memtrace.Make(Pclock)(Stack.TCP)
 
-  let start _time stack =
+  let start _pclock _time stack =
+    ( Stack.TCP.create_connection (Stack.tcp stack) (Ipaddr.of_string_exn "10.0.0.1", 5678) >|= function
+    | Ok flow -> ignore (Memtrace.start_tracing ~context:None ~sampling_rate:1e-4 flow)
+    | Error e -> Logs.warn (fun m -> m "couldn't connect to tracing sink %a"
+                              Stack.TCP.pp_error e)) >>= fun () ->
     let rooms = Bob.Secured.make () in
     let handshake socket = Lwt.return (Ok (socket, Stack.TCP.dst socket)) in
     Lwt.join

Then, just before to launch the unikernel with solo5-hvt, you can retrieve the trace via: netcat -l -p 5678 > bob.trace. Afterthat, you can use the file with memtrace-viewer (on 4.13.1). From memtrace-viewer, I tried to find what is allocated and still alive just before the Out_of_memory. The nice things is: if you launch stress.sh one time, waiting few seconds, and relaunch it, you can see a "plateau" where almost everything should be a garbage. The second launch of stress.sh will call the GC but it seems clear for me that the GC did not delete everything.

One this picture, we can see the GC in action, few objects allocated before are GCed:
Screenshot 2022-11-03 133944

If we introspect more, we can see that some allocations are kept. Let's take an area where we have such allocation:
Screenshot 2022-11-03 134101

Screenshot 2022-11-03 134241

If we see what is allocated, the memtrace-viewer mentions new_server_connection. Moreover, it seems that a "thread" is kept infinitely, specially the send_empty_ack "thread":
Screenshot 2022-11-03 134344

let thread t pcb ~send_ack ~rx_ack =
let { wnd; ack; _ } = pcb in
(* Transmit an empty ack when prompted by the Ack thread *)
let rec send_empty_ack () =
Lwt_mvar.take send_ack >>= fun _ ->
let ack_number = Window.rx_nxt wnd in
let flags = Segment.No_flags in
let options = [] in
let seq = Window.tx_nxt wnd in
ACK.transmit ack ack_number >>= fun () ->
xmit_pcb t.ip pcb.id ~flags ~wnd ~options ~seq (Cstruct.create 0) >>=
fun _ -> (* TODO: what to do if sending failed. Ignoring
* errors gives us the same behavior as if the packet
* was lost in transit *)
send_empty_ack () in
(* When something transmits an ACK, tell the delayed ACK thread *)
let rec notify () =
Lwt_mvar.take rx_ack >>= fun ack_number ->
ACK.transmit ack ack_number >>= fun () ->
notify () in
send_empty_ack () <&> (notify ())

As you can see, this thread is an infinite loop. This is where the control flow becomes complex. It seems that such thread can be cancelled if an other thread raises an exception:

let catch_and_cancel = function
| Lwt.Canceled -> ()
| ex ->
(* cancel the other threads *)
List.iter Lwt.cancel threads;
Log.err (fun f -> f "thread failure: [%s]. Terminating threads and closing connection"
(Printexc.to_string ex));
on_close ();
!Lwt.async_exception_hook ex
in
List.iter (fun t -> Lwt.on_failure t catch_and_cancel) threads;

But it's clear that in my situation, the cancellation never happens. It's really hard to follow what is going on because it's a mix between the lwt control flow (bind & async), exceptions and cancellation... I probably missed something about the cancellation of the send_empty_ack thread but the reality (memtrace-viewer) shows me that such thread is never cancelled. Due to the fact that this thread require the pcb value, it kepts multiples values and, specially the initial Cstruct.t given by Netif. So, of course, if such problem appears several times, we just accumulate Cstruct.t and a fiber under the hood.

On the other side (the application side), I added few logs which permit to "track" connections. And if you launch the stress.sh the first time and introspect logs, you can see that we have 512 new connections, and Flow.close is called 512 times. That mostly means that, on the application level, we should not have a leak about our usage of the TCP/IP stack (however, if you look memtrace-viewer, you can see few references to what Bob does, and it's normal, we want to keep a global state of the relay).

Finally, I don't expect a fix because I know that this project became more and more complex - and it's difficult to really understand what is going on. But if anyone has time to look at this leak in more detail, it would be very nice.

@haesbaert
Copy link
Member

I was more or less with you until the send_empty_ack kicked in, how did you reach that ? is it only because it's an infinite loop ?

If you look at t.channels t.connects and t.listens before and after is what you expect ?

Can't you use a hammer ? Cut all references of the TCP stack as close to the root as possible, and see if the leak goes away, maybe disconnect the whole TCP stack ? We're just trying to testify that we're not chasing some red herring.

The second idea is to count the number of waiters in Lwt, if your assumption is correct, and send_empty_ack is never reaped, then it should be a waiter, printing the number of lwt waiters before and after the first step should be enough.

Or also, just forcebly cancel the send_empty_ack on Flow.close, I'd be trying to testify that send_empty_ack is the one hanging around, and then pick it up from there.

@dinosaure
Copy link
Member Author

dinosaure commented Nov 3, 2022

I was more or less with you until the send_empty_ack kicked in, how did you reach that ? is it only because it's an infinite loop ?

From what I can tell, this thread appears for all new server connections (via new_server_connection). So this thread is started every time I get a new connection. Again, but I haven't looked into it, but it seems to me that this thread is only cancelled by Lwt.on_failure (so when one of the 3 threads has an exception) or by process_reset.

match hashtbl_find t.listens id with
| Some (_, (_, (pcb, th))) ->
Hashtbl.remove t.listens id;
Stats.decr_listen ();
STATE.tick pcb.state State.Recv_rst;
Lwt.cancel th;
Lwt.return_unit

EDIT: about send_empty_ack, I found it because on the top of the call graph (on top of new_server_connection), you can find (but I did not show that) Lwt.join called specially here:

send_empty_ack () <&> (notify ())

If you look at t.channels t.connects and t.listens before and after is what you expect ?

I did not count. In our case, only t.listens really counts. The relay does not initiate any other connections by itself (as a client).

Can't you use a hammer ? Cut all references of the TCP stack as close to the root as possible, and see if the leak goes away, maybe disconnect the whole TCP stack ? We're just trying to testify that we're not chasing some red herring.

I don't really understand what you are asking. I only notified the leak when I apply my stress test. If we introspect the beginning of our stress test, it seems that everything works fine but it is only at a certain load level that the leaks appear. Again, I have checked that in the application area:

  1. there are no leaks in the relay management (global relay state) and memtrace-viewer confirms this to me.
  2. the resources (the TCP "socket") used by the application domain are "morally" released (I counted the appearance of connections and the number of times I close these connections and I have a one-to-one correspondence).

Bob is not doing anything other than handling incoming connections.

The second idea is to count the number of waiters in Lwt, if your assumption is correct, and send_empty_ack is never reaped, then it should be a waiter, printing the number of lwt waiters before and after the first step should be enough.

Or also, just forcebly cancel the send_empty_ack on Flow.close, I'd be trying to testify that send_empty_ack is the one hanging around, and then pick it up from there.

As I said, it is difficult for me to go any further and this outcome is mainly a notification of a behaviour that I can reproduce. Unfortunately, I don't really have time to look further 😕 .

@haesbaert
Copy link
Member

I was more or less with you until the send_empty_ack kicked in, how did you reach that ? is it only because it's an infinite loop ?

From what I can tell, this thread appears for all new server connections (via new_server_connection). So this thread is started every time I get a new connection. Again, but I haven't looked into it, but it seems to me that this thread is only cancelled by Lwt.on_failure (so when one of the 3 threads has an exception) or by process_reset.

match hashtbl_find t.listens id with
| Some (_, (_, (pcb, th))) ->
Hashtbl.remove t.listens id;
Stats.decr_listen ();
STATE.tick pcb.state State.Recv_rst;
Lwt.cancel th;
Lwt.return_unit

EDIT: about send_empty_ack, I found it because on the top of the call graph (on top of new_server_connection), you can find (but I did not show that) Lwt.join called specially here:

send_empty_ack () <&> (notify ())

If you look at t.channels t.connects and t.listens before and after is what you expect ?

I did not count. In our case, only t.listens really counts. The relay does not initiate any other connections by itself (as a client).

Can't you use a hammer ? Cut all references of the TCP stack as close to the root as possible, and see if the leak goes away, maybe disconnect the whole TCP stack ? We're just trying to testify that we're not chasing some red herring.

I don't really understand what you are asking. I only notified the leak when I apply my stress test. If we introspect the beginning of our stress test, it seems that everything works fine but it is only at a certain load level that the leaks appear. Again, I have checked that in the application area:

  1. there are no leaks in the relay management (global relay state) and memtrace-viewer confirms this to me.
  2. the resources (the TCP "socket") used by the application domain are "morally" released (I counted the appearance of connections and the number of times I close these connections and I have a one-to-one correspondence).

Bob is not doing anything other than handling incoming connections.

The second idea is to count the number of waiters in Lwt, if your assumption is correct, and send_empty_ack is never reaped, then it should be a waiter, printing the number of lwt waiters before and after the first step should be enough.
Or also, just forcebly cancel the send_empty_ack on Flow.close, I'd be trying to testify that send_empty_ack is the one hanging around, and then pick it up from there.

As I said, it is difficult for me to go any further and this outcome is mainly a notification of a behaviour that I can reproduce. Unfortunately, I don't really have time to look further confused .

👍 It's nice that it's written here, hopefully someone picks up :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants