New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync_tcp_connect() implements a blocking connect #97
Comments
I would be surprised if the real issue is actually the blocking connect. Sure, you might dip under 60 FPS for that one frame where reconnection happens, but probably not far below. But yeah, hostname resolution can at times be slow, and I've seen this be pretty bad on some setups in the past. So perhaps the middle-ground of just adding a Another option would be to do the reconnect on a different thread in the application code. Rocket is apartment threaded, so it should be safe to move the |
I actually thought the same thing, but observed the long delay from connect() while testing my changes in WINE here. It only happened in WIN32 and not Linux. During development of the initial changes I saw it was the first connect() that was timing out on the first entry before iterating to the next one. I should have logged and noted what it was connecting to, it was surprising to me there were multiple entries at all when using "localhost" But it's no secret that TCP connect() can take substantial time to timeout... In *NIX there are special socket options for making connects fail more aggressively (at the expense of robustness against unreliable/slow networks), but it's definitely not the default behavior and we'd be getting even deeper into portability headaches there. It seemed prudent to just go the non-blocking connect route as that's pretty well supported everywhere in a sane manner, and we're already inherently in a polling style situation via sync_update() so it's very amenable to non-blocking connects. I've just pushed another commit I hope address your concerns to the same PR. WRT threading, I'm strongly averse to using threads for this. Introducing threads in a little library like librocket just to deal with non-blocking connects is plain gross. Turning on pthreads in a *NIX program is such an escalation, even errno accesses become complicated (behind the scenes). stdio operations now involve mutexes, formally fast stdio macros become functions, so much pain is brought into the picture to make things Just Work via -D_REENTRANT (now -pthread) just because the program is built with threading support. Nope. |
Just another thing I just noticed; Windows is the only only "mainstream" platform where we don't set |
Nah, and that doesn't get set until after the connection is established anyways. |
Oh, I didn't mean to suggest using threads in librocket. I meant that applications that cares about the speed of connect / reconnect could use threads. |
FWIW, I think I've seen something similar in the past. I don't remember what the issue with that was, but I believe it was something more like a configuration issue on the machine. I'm not sure I believe that we need to change rocket unless we know for sure that this is something that should be expected to happen... |
Ok, since I doubt we'll come to agreement on doing this better within Rocket, how about we change tack altogether: Simply add an alternative to Something like That way the caller can implement non-blocking connects, use threads, do whatever makes the most sense for them, and ignore WDYT? I'm happy to slap together a PR in that vein, closing this one. |
I didn't mean to say that I can't be convinced that it's a good idea, but there's some more complications:
So with that in mind, I would say that I (currently) think:
I believe this should get rid of substantial stalls in the main loop, while keeping things fairly similar to how it is. I think moving to non-blocking sockets is a bit like throwing out the baby with the bath-water. As for chopping up |
I hacked something up in #101... How does this look? |
It's still a blocking connect so it doesn't address all of my issues. It's a step in the right direction just by removing the resolution from every reconnect now that that's a distinct operation. But personally I'd still prefer an escape hatch like BTW another advantage to a |
But is that a real-world problem, or a theoretical one? As I wrote above, I don't think this should happen in reality, and if it does, I would rather get to the bottom of why rather than to change rocket more than needed. So I would rather we diagnose and find out exactly what the problem is, rather than jump the gun and change things here. So, could you perhaps elaborate a bit on what those issues are?
In its current form, I don't really like the details of
Weeeell, yes and no. Yeah, it allows using UNIX domain sockets, and maybe that's nice. But TBH, I'm not sure I see the big benefit of UNIX domain sockets over using a network socket. You can set the editor to only bind locally (don't remember if the UI patches for that ever landed or not) to avoid exposing the port to external machines, and in my experience, local connections are plenty performant for the use-cases for rocket, even on really old computers. But there's other options that it doesn't really work for, like serial connections. And those are something I care a lot more about than UNIX domain sockets, because they can be used to connect older hardware with no network adapter. In fact, I have had a prototype up for that a while back, and IIRC the problem is handshaking; for serial connections you don't get any connect / disconnect notifications, so you can't really handshake like that. You'll just kinda have to read / write commands right off the bat. And that means we'd either have to build something "more clever" than our current handshake on top, or just not hand-shake when using serial connections. So I'm not convinced that exposing sockets directly is as useful as it sounds like... |
It's a real-world problem that my musician experienced while testing our Rocket-using demo under Windows. The connects were blocking long enough for him to think his machine had become overloaded, because the demo's frame rate had reduced to something like 1FPS, because the Editor wasn't running, so not listening/accepting the Demo's connects via librocket. This all goes away the moment he starts the Editor. It's all running on localhost, the demo was using "localhost" as the Editor destination. The moment we start having to understand why his setup is having long connect() delays we're already completely off course. librocket as-is implements name resolution, it supports attempting an arbitrary number of connects across the returned results of the name resolution. It's already signed up for this, so it should just do it robustly and without introducing potential delays in the main loop. However, I think we can completely ignore changing this in librocket if you really prefer the simplicity of the blocking connect implementation.
It's not a peformance/throughput issue, it's a latency issue. UNIX domain sockets don't implement any sort of flow/congestion/unreliable-transport controls, TCP is a completely different beast. For instance there's no need for a TCP_NODELAY equivalent on UNIX domain sockets. And it's not just a security issue about listening locally vs. publicly, that's wasn't really even on my radar until you mentioned the Demo party malicious FORGET_TRACK scenario. It's an annoying functional limitation because the Editor is listening on an internet port in a global per-host name space (ignoring containers for obvious reasons). If we want to run multiple instances of the Editor at the same time, while working with distinct Demo instances, we're now in the business of assigning unique unnamed port numbers to these things and keeping track of which is which. We should be able to use named UNIX domain sockets in the filesystem if we want to. It's totally orthogonal to librocket's protocol handling, its API should expose a way to provide the established connection with the library ignorant of and staying out of the way of how that connection was established. I view
Ok, the serial use case sounds worth considering and makes me think what actually makes sense is being even more generalized than "sockets". Similar to the already present
So instead of the proposed
Then I've deliberately omitted connect()-related stuff altogether from sync_sockio_cb, treating that as something handled as part of getting the ctxt ready. When any of the send/recv methods fail, it just gets closed and ctxt invalidated - expecting a new The above approach conflicts with the proposed |
I suppose you're ignoring the error code from the initial TBH, the more time that is passing, the more I get annoyed by the inconsistency that we're exiting if the editor isn't running on start-up, but we survive if it's taken down and brought up again later on. That seems inconsistent and sometimes confusing. But the reasons are kinda good IMO on their own:
But the result of those two are kinda confusing. Perhaps That would of course not fix your use-case, so we need something better as well, I think...
I'm sorry, but I'm not a fan of papering over issues I don't properly understand. That leads to cargo-cult programming, and needless complexity. But, I think I understand what's going on... I was only considering the successful code-path, not the one where the connection fails, and times out! I think it all boils down to this phrasing in the POSIX spec:
... Yeah, the real problem is that "unspecified timeout"-part. I bet that is really low on Linux, and really high on Windows. And it seems the only way out of that is This reminds me, I find it a bit annoying that we currently fail to start up if the editor isn't yet
IMO, latency is another side of the "performance coin", so this is exactly what I meant above. Now that we have
...But you need to do something similar with Unix domain sockets, they also share a namespace. If you want to have multiple instances of rockets, you need some side-channel to agree on the connection details regardless of TCP bs Unix domain sockets...
That's certainly not how it is right now, but yeah, maybe that should be the future?
I think I like this direction! This could also make it possible to disable the TCP code entirely for platforms where only a serial port is available (or the user just wants to use something else, like Unix domain sockets, or just wants to work around a blocking reconnect like the discussion above)! Just a small nit: since we already have But this also makes me regret not naming it
I made sure to name it I would still like to somehow fix |
If utilized like in the example @ https://github.com/rocket/rocket/blob/main/examples/example_bass.cpp#L191 sync_tcp_connect() introduces potentially substantial stalls in the main loop.
For whatever reason, I've never noticed this in my Linux usage of librocket, the connect attempts seem to immediately fail if the editor isn't listening.. but these are over localhost.
Lately I've been producing win32 builds of a Rocket-utilizing project for my musician collaborator. He's observing significant stalls when the editor isn't running and our demo is attempting reconnects via sync_tcp_connect().
Looking at the code for sync_tcp_connect(), it's obvious why this is happening. (hostname resolution, blocking connect())
What I wonder is if it makes sense for us to just refactor how this works, so instead of repeatedly calling sync_tcp_connect() you would just do that once to put the sync_device into a mere logically "connected to host+port" state. Then transparently manage the actual connection within every sync_update() call, with maybe the sync_update() return value encoding the connect status (connection established vs. connection-in-progress vs. never told to be connected)
Then change the connect style to a non-blocking connect, with the expectation that sync_update() polls the status of the connection in progress... and restarts the connect if needed etc. But we'd never change the host+port mid-demo, it's just something you register into the device when you say it's a connected device, with the blocking name resolution happening only once at that initial make-connected-device call. That way it's a trivial matter of restarting the non-blocking-connect() process reusing the same resolved-host+port, whenever the non-blocking connect times out or whatever, from within the sync_update(), if a connected device.
I haven't given this a whole lot of thought, but it seems like something needs to be fixed here.
The text was updated successfully, but these errors were encountered: