Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent calls to nng_dial_start() for a "tls+tcp://" REQ socket are returning NNG_ECONNSHUT on Windows #1668

Open
jritts opened this issue Jun 27, 2023 · 18 comments

Comments

@jritts
Copy link

jritts commented Jun 27, 2023

I am opening multiple REQ sockets from different threads to the same REP socket, and observing nng_dial_start() returning NNG_ECONNSHUT. This is on Windows.

Strangely, nng_dial_start() only returns that error most of the time; once in a while, the tls+tcp connection succeeds.

If I switch both nodes to "tcp://" (and bypass the TLS config calls), dialing always succeeds.

With the identical REP code, but simply switching the listener to the wss:// schema, browser websocket connections always succeed.

No such issues with tls+tcp PAIR1 connections.

@jritts
Copy link
Author

jritts commented Jun 27, 2023

Here's a self-contained test for vs2022, with the output of one run on my machine. I highlighted the case where dialing a tls+tcp listener from multiple REQ sockets on different threads failed. Most of the time, all 4 dials fail, but sometimes one works.

The prebuilt libs are this version of NNG:
8e1836f

And mbed SHA: 981743d

nng_version() returns "1.6.0pre", mbedtls_version_get_string "2.28.3".

NNGReqRepTest.zip

240078641-2693266d-5b51-455f-af89-443f1957c7a4

@jritts
Copy link
Author

jritts commented Jun 27, 2023

I looped the test, and after 1-2 minutes I saw this hang:

240081213-98b3b6e6-2980-4d7e-962c-d24efb14b5ff

@jritts jritts changed the title Calls to nng_dial_start() for a "tls+tcp://" REQ socket are returning NNG_ECONNSHUT on Windows Concurrent calls to nng_dial_start() for a "tls+tcp://" REQ socket are returning NNG_ECONNSHUT on Windows Jun 27, 2023
@AnasASK
Copy link

AnasASK commented Jun 27, 2023

No such issues with tls+tcp PAIR1 connections.

What do you mean by that? Do you mean Pair1 polyamorous mode? Because normal pair1 mode is just for 1 to 1 connections.

@jritts
Copy link
Author

jritts commented Jun 27, 2023

Sorry, yes, I think I had enabled polyamorous mode. You can ignore that comment, it was just another configuration I tried to see if a multithreaded TLS client would fail for multiple protocols.

@AnasASK
Copy link

AnasASK commented Jun 28, 2023

I am also having the same issue with pair1 polyamorous mode thats why I was surprised with you reporting that it works.

@jritts
Copy link
Author

jritts commented Jun 28, 2023

Ah, I see. Sorry for the poor detail on the PAIR1 experiment - I'm not 100% sure how I configured it.

@gdamore
Copy link
Contributor

gdamore commented Jun 28, 2023

The peer has to be polyamorous if you're dialing multiple times. Generally it's not a good idea to keep redialing like this -- if you want to be resilient against a server that isn't listening, us NNG_FLAG_NONBLOCK with your dialing -- it will keep retrying in the background (but you don't get a synchronous notification of the connection)

I'll take a look at your test code.

@gdamore
Copy link
Contributor

gdamore commented Jun 28, 2023

The example is REQ/REP. I don't see anything fundamentally wrong. Perhaps there is something weird going on in mbedTLS.

@gdamore
Copy link
Contributor

gdamore commented Jun 28, 2023

So one thing that may matter is peer authentication options for TLS+TCP. I haven't looked at your specific certs and options, but that can make a big difference. Having said that, I see you're using NNG_OPT_TLS_AUTH_MODE_NONE, so none of that should matter.

@jritts
Copy link
Author

jritts commented Jun 28, 2023

Yep, this issue is specifically about req/rep. Never mind the pair1 detail.

@gdamore
Copy link
Contributor

gdamore commented Sep 14, 2023

I still haven't had a chance to look at this. Can you tell me, are you using the current main/tip from NNG, and what version of mbedTLS are you using?

@AnasASK
Copy link

AnasASK commented Sep 17, 2023

I have the same issue and these are my environment details

NNG version: I have tested with 1.5.2 and master branch
mbedtls version: I have tested with 3.0, 3.1, and 3.3
Operating system and version: Ubuntu:22.04
Compiler and language used: gcc, tested with c and c++
Shared or static library: static

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

SO I'm starting to think that there is something different in the mbedTLS 3.0 API -- perhaps some part of it that I expect to be concurrent is not.

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

Btw, using mbedTLS 2.4.0 works fine.

@jritts
Copy link
Author

jritts commented Dec 17, 2023

Okay good info, I'll put 2.4 through our testing.

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

@jritts can you adjust your application and see if it still is an issue? I'm thinking we had conflated several issues here, and the 50 msec timeouts might have been conflating the picture somewhat.

@gdamore
Copy link
Contributor

gdamore commented Dec 17, 2023

I suspect mbedTLS 2.28 (which is the LTS release of it) probably works fine too. That is the latest supported LTS version I think. Version 3.x of mbedTLS (there are many) may not be TLS... and I need to explore it further.

@jritts
Copy link
Author

jritts commented Dec 17, 2023

Will double check but I do think this is a distinct issue - I uploaded a standalone VS project in an earlier comment which had reproduced it for me. We only see it in windows.

I haven't yet tested with the newer version of mbedtls - hope to do so later today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants