Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Anomalous behavior of TCP cloak (also UDP but less important) when using Android Hotspot, unlikely to be TTL / operator filtering #229

Open
LindaFerum opened this issue Aug 6, 2023 · 32 comments

Comments

@LindaFerum
Copy link

LindaFerum commented Aug 6, 2023

Symptoms are basically that after a while the connection just "dies" without any ability to reconnect (eventual Failed to establish new connections to remote error) unless Cloak client is restarted.

Also affects UDP (more obvious there because you can just see " read from stream 1 with err " healthy messages stop with no further events) but it's the first time for me that TCP cloak got completely disabled.

NOTABLE FACT:
connecting the phone to my laptop over USB solves problem utterly and completely, both TCP and UDP cloak start working flawlessly.

I initially suspected:

  • infamous TTL-based throttling
  • local filtering shenanigans (the environment I'm currently in is hostile)
  • some particular phone anomaly

I think it is not TTL-based throttling because it has been replicated by my colleague in a decent EU country on a connection where mobile hotspot is explicitly enabled and billed separately.
Also in most cases mobile operators also detect USB Tethering and Cloak works perfectly for me when connection is shared over USB

I doubt it is local hostiles somehow detecting cloak because of aforementioned replication by a colleague in EU and because connection from here works over USB

It has been replicated on three different android phones (Pixel 5 with CalyxOS, some noname brick with LineageOS and a XIaomi unit)

The connection looks like this when breaking hard

Laptop with Cloak (linux)
|
|wifi (android hotspot)
|
Android Phone
|
internet

And it looks like this when working perfectly
Laptop with Cloak (linux)
|
|USB wire
|
Android Phone
|
internet

@notsure2
Copy link
Contributor

notsure2 commented Aug 7, 2023

It could be issues in the android wifi driver when operating in hotspot mode. Cloak transmit code / receive code is very simple (just goroutine loop copying both sides). If there's any latency issues or lags, it could be issues or tuning needed in go scheduler itself. It can also be a tcp stall/breakdown because when there's a proxy kind of situation, there appears a phenomenon similar to buffer bloat where the 2 tcp connections on both sides of cloak are both operating with different network conditions and don't "couple" their "flow rate" correctly.

Sometimes when there's a sudden network lag or throughput drop on one side, the other side doesn't feel it, and the congestion in one side of cloak is not properly transmitted as back pressure to the other side to slow down its flow rate accordingly. This is due to too large buffers used in Cloak (at the tcp level, or Cloak level) while copying between the connections.

I tried to tackle this in my fork by adding an extra configuration option or two to set the buffer size, but with limited or no effect. I suspect I haven't done something correctly, that's why I didn't make a PR on this repo with these changes yet.

Try my fork and see if playing with the extra config options related to receive buffer and send buffer changes anything in the effect you're seeing.

@notsure2
Copy link
Contributor

notsure2 commented Aug 7, 2023

Another symptom of this breakdown is that the connection doesn't achieve the maximum download rate, because the TCP congestion algorithm depends on feeling packet loss and backpressure to optimize the rate so when there's a desync in the coupling, it's not able to achieve the max possible throughput also.

@LindaFerum
Copy link
Author

LindaFerum commented Aug 7, 2023

Well @notsure2 , I tried the small-tcp-buffer branch and playing with buffer sizes have not improved anything alas.

However, I did find a workaround (a completely insane one but it works!)

I have installed QubesOS and separated Cloak into a separate proxyVM between the VM running OpenVPN and VM containing passthrough for network card

So now it looks like

Laptop with Cloak (QubesOS)
[Laptop's OpenVPN VM]
+
[Laptop's Cloak VM]
+
[Laptop's Network card VM]

|
|wifi (android hotspot)
|
Android Phone
|
internet

And this boondongle works

Cloak works very well, both TCP and UDP, almost without flaw (well, hiccups I described in another issue persist, but IMO they got better)

Also ping to my test remote improved by about 10 +/- 2 ms

I think this suggests that Cloak is somehow interacting poorly with my laptop's network card driver . Weird.

P.S.:
I do think that fine tuning buffers is an excellent idea tho
You should PR it after a check up.
In the meantime, what would be a good process for determining good values for buffer sizes? I sorta think I might play around with it more during my downtime

@notsure2
Copy link
Contributor

notsure2 commented Aug 7, 2023

Start it at 4MB and half it each time and see how this changes the behavior. Ideally you want it really really small, so that if the connection on other side seizes up or stalls, cloak also stalls the other side by stopping the copy (so that the other side couples its congestion algorithm to the clogged side's congestion algorithm).

Edit: also use the notsure2 branch as I may have made some bugfixes since merging it. The binaries i have on my fork are from the notsure2 branch so you don't need to build it.

@LindaFerum
Copy link
Author

LindaFerum commented Aug 7, 2023

@notsure2 Cool! I'll play with your notsure branch from now on. On an unrelated note, what's up with the support-cloudflare branch, does it work (and how to use it) ?

P.S.:
I was too early to speak about "everything works" after the Qubes "hack"

UDP still suffers a "hard stall" problem, but TCP appears to be immune (or at least stall takes a long while to develop)

Which I used to try and capture something in a log server-side, with what might be success

Procedure I followed - restarted everything, server included

Started Cloak clients in their own VM (networking configured appropriately)
Different UIDs for UDP and TCP clients to make sure they can be differentiated in server's log

Started an openVPN UDP session, and a ping to get idea when connection stalls (also visible in cloak client's trace output, the "reads" just stop)

As soon as UDP "irrecoverable stall" was detected I restarted openVPN with TCP config and connected to Cloak server to grab some logs.

Following logs:

Aug 07 22:03:21 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:21Z" level=info msg="Listening on :443"

Aug 07 22:03:21 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:21Z" level=info msg="Listening on :80"

Aug 07 22:03:23 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:03:23Z" level=info msg="New session" UID="UDP-CLIENT-UID==" sessionID=1248497831

Aug 07 22:06:39 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:39Z" level=info msg="New session" UID="!!!!!!TCP-CLIENT-UID==" sessionID=3365958504

Aug 07 22:06:58 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:58Z" level=warning msg="decryption/authentication faliure: cipher: message authentication failed" UID= encryptionMethod=0 proxyMethod= remoteAddr="CENSORED:36268" sessionId=0

Aug 07 22:06:59 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:04:59Z" level=warning msg="decryption/authentication faliure: cipher: message authentication failed" UID= encryptionMethod=0 proxyMethod= remoteAddr="CENSORED:37902" sessionId=0

Aug 07 22:07:07 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:07Z" level=info msg="Session closed" UID="UDP-CLIENT-UID==" reason=timeout sessionID=1248497831

Aug 07 22:07:07 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:07Z" level=info msg="Terminating active user" UID="UDP-CLIENT-UID==" reason="no session left"

Aug 07 22:07:41 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:41Z" level=info msg="New session" UID="UDP-CLIENT-UID==" sessionID=202665511

Aug 07 22:07:54 i-007d6123cd8a90283 ck-server[8742]: time="2023-08-07T22:05:54Z" level=warning msg="error reading first packet: read error after connection is established: EOF" remoteAddr="CENSORED:50530"

P.P.S.:
AAAAAAND AS SOON AS I TYPED IN "TCP APPEARS IMMUNE" THE TCP CONNECTION ALSO HAD A STALL

@LindaFerum
Copy link
Author

Okay it seems like limiting buffers to exactly same size (currently all set to exactly same 5mb size) for both send and receive on all ends AND changing the nofile limit for client and server to something big (30k or so) massively improves things.

I will let test rig run overnight and see if Cloak UDP seizes up or nah

@LindaFerum
Copy link
Author

Okay, @notsure2 after playing around with buffers a fair lot I have following to report:
Having buffers configured to around 5 mb (same on server and client) massively improved connection throughput on my current link (I can push 20MB/s both up and down on this crappy finicky connection) but does not resolve stalls.

NOFILE is sadly likely placebo effect (makes sense, lsof shows very few file descriptors open by cloak client)

Stalls become harder to reproduce but very very much remain
UDP Cloak would stall once several hours or so, but when it stalls it stalls for good (even full client restart doesn't always clear them for some mystery reason)

TCP cloak stalls less frequently still (I'd say I could push a full workday in without a stall in a realistic setting)

but unlike with UDP a peculiar phenomenon emerged with TCP Cloak stall and 5mb buffer:

it became possible to semi-reliably trigger a TCP cloak stall by pushing the connection as much as possible (two 720HD 60 fps videos + speedof.me test on loop + ping running in background do it for mine), after mere minutes of such abuse TCP-Cloak would stall and connection would not recover until client restart.

Cloak client logs remain somewhat unenlightening IMHO:

INFO[2023-08-08T14:39:07****] Attempting to start a new session
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
TRAC[2023-08-08T14:39:07****] client hello sent successfully
TRAC[2023-08-08T14:39:07****] waiting for ServerHello
DEBU[2023-08-08T14:39:07****] All underlying connections established
INFO[2023-08-08T14:39:07****] Session 227931397 established
TRAC[2023-08-08T14:39:07****] stream 1 of session 227931397 opened
TRAC[2023-08-08T14:39:09****] stream 1 actively closed.
DEBU[2023-08-08T14:39:09****] session 227931397 has no active stream left
TRAC[2023-08-08T14:39:09****] copying proxy client to stream: EOF
TRAC[2023-08-08T14:39:09****] 0 read from stream 1 with err EOF
TRAC[2023-08-08T14:39:09****] copying stream to proxy client: broken stream
DEBU[2023-08-08T14:39:23****] Setting loopback connection tcp send buffer: 5120
DEBU[2023-08-08T14:39:23****] Setting loopback connection tcp receive buffer: 5120
TRAC[2023-08-08T14:39:23****] stream 2 of session 227931397 opened
TRAC[2023-08-08T14:48:35****] stream 2 actively closed. <--- trouble starts approximately here
DEBU[2023-08-08T14:48:35****] session 227931397 has no active stream left
TRAC[2023-08-08T14:48:35****] copying proxy client to stream: EOF
TRAC[2023-08-08T14:48:35****] 0 read from stream 2 with err EOF
TRAC[2023-08-08T14:48:35****] copying stream to proxy client: broken stream
DEBU[2023-08-08T14:48:36****] Setting loopback connection tcp send buffer: 5120
DEBU[2023-08-08T14:48:36****] Setting loopback connection tcp receive buffer: 5120
TRAC[2023-08-08T14:48:36****] stream 3 of session 227931397 opened
TRAC[2023-08-08T14:49:50****] stream 3 actively closed.
DEBU[2023-08-08T14:49:50****] session 227931397 has no active stream left
TRAC[2023-08-08T14:49:50****] copying proxy client to stream: EOF
TRAC[2023-08-08T14:49:50****] 0 read from stream 3 with err EOF
TRAC[2023-08-08T14:49:50****] copying stream to proxy client: broken stream
DEBU[2023-08-08T14:49:52****] Setting loopback connection tcp send buffer: 5120
DEBU[2023-08-08T14:49:52****] Setting loopback connection tcp receive buffer: 5120
TRAC[2023-08-08T14:49:52****] stream 4 of session 227931397 opened

My non-programmer bellyfeel here is that cloak (likely cloak-client because server seems doing super fine) is silently running out of some resource and doesn't raise an error, instead simply stops sending/receiving

@notsure2
Copy link
Contributor

notsure2 commented Aug 8, 2023

@LindaFerum I can't remember the exact details as I found it ultimately useless since Cloudflare throttles the free tier websocket throughput.

  1. you add your domain to cloudflare
  2. you change your domain nameservers to cloudflare
  3. you change the a record of your domain to proxied
  4. on your real server, you setup nginx on port 80
  5. use acme.sh to get a real ssl for your domain
  6. change nginx ssl port to something else and put cloak on port 443 and set redir addr to localhost nginx port
  7. Use CDNWsUrlPath on client to set the secret url that will go to cloak, let's say /cloak
  8. add to nginx config: location /cloak { proxy_pass http://localhost:443/; }
  9. Make sure websocket support is enabled in cloudflare settings
  10. on client set cdn mode, set servername = your real domain, set mode cdn, set CdnWsUrlPath=/cloak
  11. It should work.

@LindaFerum
Copy link
Author

@notsure2 thanks! I'll try it.
no way to have "ServerName" something relatively inconspicuous with Cloudflare, right?
P.S.:
Any ideas on stalls (see my report on playing around with buffer and provoking stall above)

@notsure2
Copy link
Contributor

notsure2 commented Aug 8, 2023

No. CloudFlare shut off domain fronting, so a hostile isp can zero in on your domain and SNI sniff tcp reset it.

I have experienced these stalls also, but I don't know exactly why they happen. My suspicion points to some kind of goroutine issue or scheduler issue. Cloak logging is not good and is hard to interpret flows from it, it needs a some improvement to become more legible.

From your log there might be some kind of dynamic blocking done by the ISP, the session that opens and immediately closes looks like it got a TCP reset from the ISP. Try to collect a wireshark at the same time as stalls happen and analyze the TCP Flows to see if there's any abnormal behavior.

Try to also see by timestamp matching if there's any error reported on the server side. When ISP sends tcp reset, they send it to both sides to cut the connection from both ends. The server will report this in its log. I think tcp reset shows up as EOF.

@notsure2
Copy link
Contributor

notsure2 commented Aug 8, 2023

I want some time to tackle this issue very deeply but sadly I don't have it in these days.

@LindaFerum
Copy link
Author

I did consider this being an "ISP asshole thing" since I'm currently in a relatively hostile environment

I'm not good with wireshark but I may try using a "guaranteed no-asshole" satellite connection to see if Cloak-TCP will stall there.

@notsure2
Copy link
Contributor

notsure2 commented Aug 8, 2023

Try NumConn=0. I implemented this specifically to avoid stalls by removing the multiplexing and having each connection map to 1 short-lived session. This works reasonably ok for me.

@LindaFerum
Copy link
Author

Further reporting:
Inducing a stall over satellite proved much harder but after some connection abuse Cloak-TCP over satellite gave up and stalled.
Of note, it is a high latency low bandwidth thing that doesn't reach 1Mb/s under any circumstances, so maybe that is what is making stalling induction harder on sat
(race-condition like behavior?)

Will try NumConn=0 later

@LindaFerum
Copy link
Author

Well, @notsure2 for what it's worth NumConn=0 does seem to help with stalls a lot even on "bad/sus" connection (which I think substantially rules out previous stalls were ISP meddling) but it also it affects performance a lot (compared to the blast I can get out of this connection with multiplexing and 5MB buffers) so if something race-like is happening here it may be that improvement is simply due to lower probability of triggering the stall event.

Overall I think I'll stick with multiplexing,5MB buffers and TCP mode for now
Under normal conditions stalls appear quite rare and I think I can jerry rig something that restarts the cloak client when connection times out as stopgap mitigation

@cbeuw
Copy link
Owner

cbeuw commented Oct 8, 2023

I think I managed to get a reliable reproduction of this issue locally by creating artificial latency with https://github.com/Shopify/toxiproxy and making 10 parallel connections

@LindaFerum
Copy link
Author

@cbeuw Awesome! Glad to know I caught an actual real bug and not just some mystery weirdness :)

@cbeuw cbeuw closed this as completed in eca5f13 Nov 12, 2023
@cbeuw
Copy link
Owner

cbeuw commented Nov 12, 2023

For those who are interested, this is the bug:

Cloak has a concept of Streams. Each stream represents a TCP connection between the proxied application client and server (there is only one stream per session in UDP mode). As there are usually far fewer Cloak client to Cloak server TLS connections than there are Streams, data from each stream are interleaved in TLS connections.

Upon receiving some data, Cloak server must figure out which stream the data need to go using metadata attached by Cloak client, then send it to the corresponding TCP connection to the application server. This is done in session.recvDataFromRemote. This is called on the thread that manages a Cloak client to server TLS connection, so this function must not block. If it blocked, then all data that were sent to the thread's TLS connection are queued up, which may include data from other streams. If one server-side application has issues, then this will spread to the rest of the session.

To prevent blocking, each stream has a buffer (recvBuffer) that's effectively a channel - data can be concurrently written to and read from it. If the application isn't reading fast enough, its data will be buffered, allowing the TLS thread to move on quickly.

However, a while ago I made an optimisation by adding a WriteTo method to recvBuffer. WriteTo reduces the amount of memory copy by writing data directly into a Writer interface rather than a buffer. This made recvBuffer's non-blocking property dependent on the non-blocking of the Writer - in this case a TCP connection that goes to an application server. This is wrong, because TCP sockets in Cloak are not guaranteed to be non-blocking.

This issue is hard to reproduce because most of the times, writing to a TCP connection to an application appear non-blocking. Most apps will read data as they come, plus there's also a buffer in the OS if the app isn't consuming data fast enough. Writing to a TCP socket only blocks when this OS buffer is full, which doesn't happen often.

But a stall will happen when one application has two incoming connections A and B, where data coming from A is only consumed when something is received on B. Data through A and B may be routed through the same TLS connection with packets for A at the front and packet for B at the very back. Cloak may fill up A's receiving buffer, causing recvDataFromRemote to block before packet for B is processed. This is a deadlock, as A's receiving buffer can only be cleared up once the app sees data on B, but the app can't see data on B until A's receiving buffer is cleared up.

Simply removing the faulty optimisation should fix the issue.

@LindaFerum if you're happy to, could you let me know if the problem is fixed for you? I'll make a new release if you don't see any more problems.

Pre-release with the fix here: https://github.com/cbeuw/Cloak/releases/tag/v2.7.1-pre

@LindaFerum
Copy link
Author

@cbeuw awesome news, thanks! I'll try to test it out around next week or maybe even on weekend if I get lucky

@notsure2
Copy link
Contributor

I think I'm still seeing this issue but in UDP. Will try to figure it out.

@notsure2
Copy link
Contributor

I'm still seeing this issue on TCP also, but it's more rare than before. I think this is not completely solved yet.

@LindaFerum
Copy link
Author

@cbeuw @notsure2 can confirm it still happening on TCP albeit reproducing it became trickier (not sure what a good process would be, seems "lots of traffic and wait" works, waiting just got longer)

@LindaFerum
Copy link
Author

Any news on this one ? (no rush just checking back :) )

@notsure2
Copy link
Contributor

notsure2 commented Feb 6, 2024 via email

@LindaFerum
Copy link
Author

well, @notsure2 I guess it does feel like it.
Observation - when redirecting traffic to Cloak via shadowsocks I get stalls far easier when I use encryption:none on the shadowsocks (shadowsocks-rust allows that out of the box)
Switching to encryption aes-128-gcm or whatever reduces number of Cloak hiccups.

It would appear that shadowsocks spending extra cycles encrypting/decrypting stuff introduces additional latency that makes inducing a scheduling fault somewhere in goroutine harder. Meanwhile with "encryption:none" and high load shadowsocks fires a veritable firehose of packets and that does something bad scheduling-wise on Cloak/goroutine side.

Makes sense?

@notsure2
Copy link
Contributor

notsure2 commented Feb 8, 2024 via email

@LindaFerum
Copy link
Author

Maybe given that "thing still happens" the issue should be reopened (for proper recordkeeping and tracking and such? ) @cbeuw

@cbeuw cbeuw reopened this Feb 9, 2024
@cbeuw
Copy link
Owner

cbeuw commented Feb 9, 2024

Reopened. I can still reproduce this issue locally if I fully load a large number of streams down a small number of TLS connections. I suspect there's something to do with goroutine scheduling when a TCP connection's buffer is full

@LindaFerum
Copy link
Author

hey @cbeuw @notsure2 sorry to be a useless nagging needy user :-) but any progress on this one?
I've had some success working around this by non-cloak means (basically, moving some of the load to an inferior anti-detection mechanism and also leveraging the fact that Amazon EC2 can shed IPv4 like a lizard sheds its tail lol so when a server gets detected it's a stop-start+update DNS) but every once in a while this bug starts kicking my ass.

@notsure2
Copy link
Contributor

@cbeuw maybe you need to take the scheduling into own hands... so have like an infinite loop go routines and queue some function on it to control the order of execution

@cbeuw
Copy link
Owner

cbeuw commented Apr 14, 2024

maybe you need to take the scheduling into own hands

The blocking happens inside IO calls to net.TCPConn. Only Go runtime can schedule goroutines blocked on IO, so I can't really do much about it.

Though @LindaFerum I did figure out this to be part of the problem

case fixedConnMapping:
// FIXME: this strategy has a tendency to cause a TLS conn socket buffer to fill up,
// which is a problem when multiple streams are mapped to the same conn, resulting
// in all such streams being blocked.
conn = *assignedConn
if conn == nil {
conn, err = sb.pickRandConn()
if err != nil {
sb.session.SetTerminalMsg("failed to pick a connection " + err.Error())
sb.session.passiveClose()
return 0, err
}
*assignedConn = conn
}
n, err = conn.Write(data)
if err != nil {
sb.session.SetTerminalMsg("failed to send to remote " + err.Error())
sb.session.passiveClose()
return n, err
}

I made an update to always spread the load out evenly across all TLS conns instead. From local testing it appears more resilient. I could still break the connection (with an error, not silently) if I open say 100 connections at once, though this should be unlikely in real scenarios. This is released in https://github.com/cbeuw/Cloak/releases/tag/v2.9.0

@LindaFerum
Copy link
Author

FWIW I can still trigger it with UDP it seems and it stalls pretty heavily when that happens.
Haven't been able to do that to TCP (good news :-) )

Modifying keepalive (as suggested in mention above) does not seem to help much

@cbeuw would it perhaps be possible to add some more aggressive detection of stall event and a stall clearing "under the hood" process, ideally separate from keepalives proper (not that keepalives are helping much like I said) so that cloak could recover somewhat gracefully and automatically from this stall issue ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants