Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync_tcp_connect() implements a blocking connect #97

Open
vcaputo opened this issue Jul 22, 2023 · 13 comments
Open

sync_tcp_connect() implements a blocking connect #97

vcaputo opened this issue Jul 22, 2023 · 13 comments

Comments

@vcaputo
Copy link
Contributor

vcaputo commented Jul 22, 2023

If utilized like in the example @ https://github.com/rocket/rocket/blob/main/examples/example_bass.cpp#L191 sync_tcp_connect() introduces potentially substantial stalls in the main loop.

For whatever reason, I've never noticed this in my Linux usage of librocket, the connect attempts seem to immediately fail if the editor isn't listening.. but these are over localhost.

Lately I've been producing win32 builds of a Rocket-utilizing project for my musician collaborator. He's observing significant stalls when the editor isn't running and our demo is attempting reconnects via sync_tcp_connect().

Looking at the code for sync_tcp_connect(), it's obvious why this is happening. (hostname resolution, blocking connect())

What I wonder is if it makes sense for us to just refactor how this works, so instead of repeatedly calling sync_tcp_connect() you would just do that once to put the sync_device into a mere logically "connected to host+port" state. Then transparently manage the actual connection within every sync_update() call, with maybe the sync_update() return value encoding the connect status (connection established vs. connection-in-progress vs. never told to be connected)

Then change the connect style to a non-blocking connect, with the expectation that sync_update() polls the status of the connection in progress... and restarts the connect if needed etc. But we'd never change the host+port mid-demo, it's just something you register into the device when you say it's a connected device, with the blocking name resolution happening only once at that initial make-connected-device call. That way it's a trivial matter of restarting the non-blocking-connect() process reusing the same resolved-host+port, whenever the non-blocking connect times out or whatever, from within the sync_update(), if a connected device.

I haven't given this a whole lot of thought, but it seems like something needs to be fixed here.

@kusma
Copy link
Member

kusma commented Jul 25, 2023

I would be surprised if the real issue is actually the blocking connect. Sure, you might dip under 60 FPS for that one frame where reconnection happens, but probably not far below.

But yeah, hostname resolution can at times be slow, and I've seen this be pretty bad on some setups in the past.

So perhaps the middle-ground of just adding a sync_reconnect() I already suggested in #98 would be sufficient? Then we could omit the hostname resolution.

Another option would be to do the reconnect on a different thread in the application code. Rocket is apartment threaded, so it should be safe to move the sync_connect()-call (or sync_reconnect() if we want to do both, might be a good idea) to a different thread without too many worries.

@vcaputo
Copy link
Contributor Author

vcaputo commented Aug 11, 2023

I actually thought the same thing, but observed the long delay from connect() while testing my changes in WINE here. It only happened in WIN32 and not Linux. During development of the initial changes I saw it was the first connect() that was timing out on the first entry before iterating to the next one. I should have logged and noted what it was connecting to, it was surprising to me there were multiple entries at all when using "localhost"

But it's no secret that TCP connect() can take substantial time to timeout... In *NIX there are special socket options for making connects fail more aggressively (at the expense of robustness against unreliable/slow networks), but it's definitely not the default behavior and we'd be getting even deeper into portability headaches there. It seemed prudent to just go the non-blocking connect route as that's pretty well supported everywhere in a sane manner, and we're already inherently in a polling style situation via sync_update() so it's very amenable to non-blocking connects.

I've just pushed another commit I hope address your concerns to the same PR.

WRT threading, I'm strongly averse to using threads for this. Introducing threads in a little library like librocket just to deal with non-blocking connects is plain gross. Turning on pthreads in a *NIX program is such an escalation, even errno accesses become complicated (behind the scenes). stdio operations now involve mutexes, formally fast stdio macros become functions, so much pain is brought into the picture to make things Just Work via -D_REENTRANT (now -pthread) just because the program is built with threading support. Nope.

@kusma
Copy link
Member

kusma commented Aug 14, 2023

Just another thing I just noticed; Windows is the only only "mainstream" platform where we don't set USE_NODELAY to disable Nagle's algorithm. Do you think that could be the cause for the delay on Windows?

@vcaputo
Copy link
Contributor Author

vcaputo commented Aug 15, 2023

Nah, and that doesn't get set until after the connection is established anyways.

@kusma
Copy link
Member

kusma commented Aug 15, 2023

WRT threading, I'm strongly averse to using threads for this. Introducing threads in a little library like librocket just to deal with non-blocking connects is plain gross.

Oh, I didn't mean to suggest using threads in librocket. I meant that applications that cares about the speed of connect / reconnect could use threads.

@kusma
Copy link
Member

kusma commented Aug 15, 2023

I actually thought the same thing, but observed the long delay from connect() while testing my changes in WINE here. It only happened in WIN32 and not Linux. During development of the initial changes I saw it was the first connect() that was timing out on the first entry before iterating to the next one. I should have logged and noted what it was connecting to, it was surprising to me there were multiple entries at all when using "localhost"

FWIW, I think I've seen something similar in the past. I don't remember what the issue with that was, but I believe it was something more like a configuration issue on the machine. I'm not sure I believe that we need to change rocket unless we know for sure that this is something that should be expected to happen...

@vcaputo
Copy link
Contributor Author

vcaputo commented Aug 16, 2023

Ok, since I doubt we'll come to agreement on doing this better within Rocket, how about we change tack altogether:

Simply add an alternative to sync_tcp_connect() for callers who want to manage the connect process on their own.

Something like sync_use_socket(), which just assigns the supplied socket to d->sock, expects it to be an established blocking socket just like sync_tcp_connect() would have put there, then resumes the whole handshake/greeting process shared with the second half of sync_tcp_connect(). The device owns the socket once it's "used" by it, closing it on error all the same, etc.

That way the caller can implement non-blocking connects, use threads, do whatever makes the most sense for them, and ignore sync_tcp_connect() if it's a poor fit for their use case.

WDYT? I'm happy to slap together a PR in that vein, closing this one.

@kusma
Copy link
Member

kusma commented Aug 16, 2023

I didn't mean to say that I can't be convinced that it's a good idea, but there's some more complications:

  1. A lot of people have integrated rocket into their work-flow. Deprecating/changing APIs needs a very good justification IMO, and I'm not sure I agree that the justification is good enough here.
  2. Making sync_tcp_connect() not actually finishing setting up the connection makes it more likely that a demo could start rendering incomplete data at first. Really, we should probably even wait the server ack that it has sent all initial data before returning from sync_tcp_connect(), but it's unclear to me how to do this without breaking the protocol, so 🤷.
  3. Keeping Rocket as simple as possible matters, especially on low-memory systems. Rocket is being used on systems like Amigas and GBAs where you only have a couple of hundred kilobytes of memory. Keeping the code-size down matters to these users.

So with that in mind, I would say that I (currently) think:

  1. Unless there's a good reason why not, we shouldn't require sync_tcp_connect() to be real-time-ish (as in taking max a few milliseconds on any realistic system). It shouldn't take seconds, but somewhere in between. I believe this should currently be the case, and if you're seeing long stalls here, that's probably either a bug or a configuration problem.
  2. sync_update() should be real-time-ish. I believe this is currently the case.
  3. We should probably introduce a sync_tcp_reconnect() that omits looking up the host, but just attempts to reconnect to what it just lost the connection to. That function should probably be close to real-time-ish, but dropping a frame or two here probably isn't the biggest deal. If connect() takes more than that, something else is up; on a local area network, we should be able to connect almost immediately. Yes, connect() can be somewhat slow, but not slow like this, at least not in the usecases Rocket was made for (it's not made for dial-up connections or ethernet over carrier pigeons 😆)...
  4. We should update the examples to use sync_tcp_reconnect() instead of sync_tcp_connect() in the main-loop.

I believe this should get rid of substantial stalls in the main loop, while keeping things fairly similar to how it is.

I think moving to non-blocking sockets is a bit like throwing out the baby with the bath-water.

As for chopping up sync_tcp_connect() like you suggest above, I don't think I understand what that solves that the sync_tcp_reconnect() proposal doesn't already solve... And it makes the API more confusing IMO.

@kusma
Copy link
Member

kusma commented Aug 16, 2023

I hacked something up in #101... How does this look?

@vcaputo
Copy link
Contributor Author

vcaputo commented Aug 16, 2023

I hacked something up in #101... How does this look?

It's still a blocking connect so it doesn't address all of my issues.

It's a step in the right direction just by removing the resolution from every reconnect now that that's a distinct operation.

But personally I'd still prefer an escape hatch like sync_use_socket() for callers that won't use any of the included resolution/connect API. That way I can use non-blocking connects and my own name resolution, while still utilizing the Rocket library for just the protocol handling - with an expectation that we'll evolve the protocol a bit over time.

BTW another advantage to a sync_use_socket() escape hatch is you're no longer limited to TCP sockets. I'd like to leave UNIX domain sockets on the table in my use case, which will often have the Editor, Demo, and experimentally right now SchismTracker too, all running on the same host. local-only sockets like UNIX sockets have less headaches like connect timeouts and latency sources TCP_NODELAY exist to help defeat. We don't have to add support for such variants in Rocket if it just offers a way to say "here, use this established socket descriptor, do your handshake and run the protocol for me, thanks."

@kusma
Copy link
Member

kusma commented Aug 17, 2023

It's still a blocking connect so it doesn't address all of my issues.

But is that a real-world problem, or a theoretical one? As I wrote above, I don't think this should happen in reality, and if it does, I would rather get to the bottom of why rather than to change rocket more than needed.

So I would rather we diagnose and find out exactly what the problem is, rather than jump the gun and change things here.

So, could you perhaps elaborate a bit on what those issues are?

But personally I'd still prefer an escape hatch like sync_use_socket() for callers that won't use any of the included resolution/connect API.

In its current form, I don't really like the details of sync_use_rocket(); it breaks the abstraction of rocket. SOCKET is a type that is only defined in the internal headers, and not something that applications would know about. And because sockets aren't as "universally file descriptors" on all platforms as they are on UNIX-y systems, I'm skeptical to exposing them in the API surface.

BTW another advantage to a sync_use_socket() escape hatch is you're no longer limited to TCP sockets. I'd like to leave UNIX domain sockets on the table in my use case, which will often have the Editor, Demo, and experimentally right now SchismTracker too, all running on the same host.

Weeeell, yes and no. Yeah, it allows using UNIX domain sockets, and maybe that's nice. But TBH, I'm not sure I see the big benefit of UNIX domain sockets over using a network socket. You can set the editor to only bind locally (don't remember if the UI patches for that ever landed or not) to avoid exposing the port to external machines, and in my experience, local connections are plenty performant for the use-cases for rocket, even on really old computers.

But there's other options that it doesn't really work for, like serial connections. And those are something I care a lot more about than UNIX domain sockets, because they can be used to connect older hardware with no network adapter.

In fact, I have had a prototype up for that a while back, and IIRC the problem is handshaking; for serial connections you don't get any connect / disconnect notifications, so you can't really handshake like that. You'll just kinda have to read / write commands right off the bat. And that means we'd either have to build something "more clever" than our current handshake on top, or just not hand-shake when using serial connections.

So I'm not convinced that exposing sockets directly is as useful as it sounds like...

@vcaputo
Copy link
Contributor Author

vcaputo commented Aug 17, 2023

It's still a blocking connect so it doesn't address all of my issues.

But is that a real-world problem, or a theoretical one? As I wrote above, I don't think this should happen in reality, and if it does, I would rather get to the bottom of why rather than to change rocket more than needed.

So I would rather we diagnose and find out exactly what the problem is, rather than jump the gun and change things here.

So, could you perhaps elaborate a bit on what those issues are?

It's a real-world problem that my musician experienced while testing our Rocket-using demo under Windows.

The connects were blocking long enough for him to think his machine had become overloaded, because the demo's frame rate had reduced to something like 1FPS, because the Editor wasn't running, so not listening/accepting the Demo's connects via librocket.

This all goes away the moment he starts the Editor. It's all running on localhost, the demo was using "localhost" as the Editor destination.

The moment we start having to understand why his setup is having long connect() delays we're already completely off course.

librocket as-is implements name resolution, it supports attempting an arbitrary number of connects across the returned results of the name resolution. It's already signed up for this, so it should just do it robustly and without introducing potential delays in the main loop.

However, I think we can completely ignore changing this in librocket if you really prefer the simplicity of the blocking connect implementation.

But personally I'd still prefer an escape hatch like sync_use_socket() for callers that won't use any of the included resolution/connect API.

In its current form, I don't really like the details of sync_use_rocket(); it breaks the abstraction of rocket. SOCKET is a type that is only defined in the internal headers, and not something that applications would know about. And because sockets aren't as "universally file descriptors" on all platforms as they are on UNIX-y systems, I'm skeptical to exposing them in the API surface.

BTW another advantage to a sync_use_socket() escape hatch is you're no longer limited to TCP sockets. I'd like to leave UNIX domain sockets on the table in my use case, which will often have the Editor, Demo, and experimentally right now SchismTracker too, all running on the same host.

Weeeell, yes and no. Yeah, it allows using UNIX domain sockets, and maybe that's nice. But TBH, I'm not sure I see the big benefit of UNIX domain sockets over using a network socket. You can set the editor to only bind locally (don't remember if the UI patches for that ever landed or not) to avoid exposing the port to external machines, and in my experience, local connections are plenty performant for the use-cases for rocket, even on really old computers.

It's not a peformance/throughput issue, it's a latency issue. UNIX domain sockets don't implement any sort of flow/congestion/unreliable-transport controls, TCP is a completely different beast. For instance there's no need for a TCP_NODELAY equivalent on UNIX domain sockets.

And it's not just a security issue about listening locally vs. publicly, that's wasn't really even on my radar until you mentioned the Demo party malicious FORGET_TRACK scenario.

It's an annoying functional limitation because the Editor is listening on an internet port in a global per-host name space (ignoring containers for obvious reasons).

If we want to run multiple instances of the Editor at the same time, while working with distinct Demo instances, we're now in the business of assigning unique unnamed port numbers to these things and keeping track of which is which. We should be able to use named UNIX domain sockets in the filesystem if we want to. It's totally orthogonal to librocket's protocol handling, its API should expose a way to provide the established connection with the library ignorant of and staying out of the way of how that connection was established. I view sync_tcp_connect() as purely a convenience function which works for many use cases, but it should be decoupled from the core protocol implementation for more advanced use cases that can wire things up differently.

But there's other options that it doesn't really work for, like serial connections. And those are something I care a lot more about than UNIX domain sockets, because they can be used to connect older hardware with no network adapter.

In fact, I have had a prototype up for that a while back, and IIRC the problem is handshaking; for serial connections you don't get any connect / disconnect notifications, so you can't really handshake like that. You'll just kinda have to read / write commands right off the bat. And that means we'd either have to build something "more clever" than our current handshake on top, or just not hand-shake when using serial connections.

So I'm not convinced that exposing sockets directly is as useful as it sounds like...

Ok, the serial use case sounds worth considering and makes me think what actually makes sense is being even more generalized than "sockets".

Similar to the already present struct sync_io_cb abstraction, why not just add another, something like:

struct sync_sockio_cb {
        int send(void *ctxt, void *buf, int len); // send size bytes to buf, returns -1 on error or +n_sent
        int recv(void *ctxt, void *buf, int size); // recv up to size bytes into buf, returns -1 on error or +n_received
        int close(void *ctxt); // tears down ctxt, never to be used again.
};

So instead of the proposed sync_use_socket() it'd be something like sync_use_sockio(), with callers doing their own heavy lifting:

struct sync_sockio_cb sockio_cb = {
        .send = mysend,
        .recv = myrecv,
        .close = myclose,
};

mysioctxt = my_sio_allocated_and_connected_however_i_want();
sync_use_sockio(device, &mysio, mysioctxt);

Then sync_tcp_connect() would internally just use sync_use_sockio() with its own stuff filled in reusing the existing TCP-oriented send/recv stuff already in librocket. It would perform its own resolution and connect(), allocating a ctxt of its liking for containing that junk, which its send/recv/close would know how to make use of.

I've deliberately omitted connect()-related stuff altogether from sync_sockio_cb, treating that as something handled as part of getting the ctxt ready. When any of the send/recv methods fail, it just gets closed and ctxt invalidated - expecting a new sync_use_sockio() to get things rolling again after whatever calls it gets that ctxt ready.

The above approach conflicts with the proposed sync_reconnect(), which IMO probably shouldn't exist except maybe as sync_tcp_reconnect(), with sync_tcp_connect() and sync_tcp_reconnect() both making use of some tcp-specific state stowed in struct sync_device that nothing else uses outside of those two methods... make the whole connect/reconnect conceptually bound to sync_tcp convenience-helper land.

@kusma
Copy link
Member

kusma commented Aug 18, 2023

It's still a blocking connect so it doesn't address all of my issues.

But is that a real-world problem, or a theoretical one? As I wrote above, I don't think this should happen in reality, and if it does, I would rather get to the bottom of why rather than to change rocket more than needed.
So I would rather we diagnose and find out exactly what the problem is, rather than jump the gun and change things here.
So, could you perhaps elaborate a bit on what those issues are?

It's a real-world problem that my musician experienced while testing our Rocket-using demo under Windows.

The connects were blocking long enough for him to think his machine had become overloaded, because the demo's frame rate had reduced to something like 1FPS, because the Editor wasn't running, so not listening/accepting the Demo's connects via librocket.

I suppose you're ignoring the error code from the initial sync_tcp_connect() unlike what example_bass is doing, then? Because if not, it's a result of closing the editor, and the cause should be pretty obvious IMO...

TBH, the more time that is passing, the more I get annoyed by the inconsistency that we're exiting if the editor isn't running on start-up, but we survive if it's taken down and brought up again later on. That seems inconsistent and sometimes confusing.

But the reasons are kinda good IMO on their own:

  • If the editor isn't running on start-up, the user probably made a mistake, and we'd like them to know...
  • If the editor goes down, it's often because we're in the process of fixing bugs or adding features. Having to re-start the demo and do all loading all over seems wasteful.

But the result of those two are kinda confusing. Perhaps example_bass shouldn't reconnect unless it's a debug build or something like that?

That would of course not fix your use-case, so we need something better as well, I think...

This all goes away the moment he starts the Editor. It's all running on localhost, the demo was using "localhost" as the Editor destination.

The moment we start having to understand why his setup is having long connect() delays we're already completely off course.

I'm sorry, but I'm not a fan of papering over issues I don't properly understand. That leads to cargo-cult programming, and needless complexity.

But, I think I understand what's going on... I was only considering the successful code-path, not the one where the connection fails, and times out!

I think it all boils down to this phrasing in the POSIX spec:

If the connection cannot be established immediately and O_NONBLOCK is not set for the file descriptor for the socket, connect() shall block for up to an unspecified timeout interval until the connection is established.

... Yeah, the real problem is that "unspecified timeout"-part. I bet that is really low on Linux, and really high on Windows.

And it seems the only way out of that is O_NONBLOCK, indeed. Yeah, that's not great...

This reminds me, I find it a bit annoying that we currently fail to start up if the editor isn't yet

Weeeell, yes and no. Yeah, it allows using UNIX domain sockets, and maybe that's nice. But TBH, I'm not sure I see the big benefit of UNIX domain sockets over using a network socket. You can set the editor to only bind locally (don't remember if the UI patches for that ever landed or not) to avoid exposing the port to external machines, and in my experience, local connections are plenty performant for the use-cases for rocket, even on really old computers.

It's not a peformance/throughput issue, it's a latency issue. UNIX domain sockets don't implement any sort of flow/congestion/unreliable-transport controls, TCP is a completely different beast. For instance there's no need for a TCP_NODELAY equivalent on UNIX domain sockets.

IMO, latency is another side of the "performance coin", so this is exactly what I meant above.

Now that we have TCP_NODELAY, is latency really still an issue? I certainly don't feel it is, but YMMV of course...

It's an annoying functional limitation because the Editor is listening on an internet port in a global per-host name space (ignoring containers for obvious reasons).

If we want to run multiple instances of the Editor at the same time, while working with distinct Demo instances, we're now in the business of assigning unique unnamed port numbers to these things and keeping track of which is which. We should be able to use named UNIX domain sockets in the filesystem if we want to.

...But you need to do something similar with Unix domain sockets, they also share a namespace.

If you want to have multiple instances of rockets, you need some side-channel to agree on the connection details regardless of TCP bs Unix domain sockets...

It's totally orthogonal to librocket's protocol handling, its API should expose a way to provide the established connection with the library ignorant of and staying out of the way of how that connection was established. I view sync_tcp_connect() as purely a convenience function which works for many use cases, but it should be decoupled from the core protocol implementation for more advanced use cases that can wire things up differently.

That's certainly not how it is right now, but yeah, maybe that should be the future?

But there's other options that it doesn't really work for, like serial connections. And those are something I care a lot more about than UNIX domain sockets, because they can be used to connect older hardware with no network adapter.
In fact, I have had a prototype up for that a while back, and IIRC the problem is handshaking; for serial connections you don't get any connect / disconnect notifications, so you can't really handshake like that. You'll just kinda have to read / write commands right off the bat. And that means we'd either have to build something "more clever" than our current handshake on top, or just not hand-shake when using serial connections.
So I'm not convinced that exposing sockets directly is as useful as it sounds like...

Ok, the serial use case sounds worth considering and makes me think what actually makes sense is being even more generalized than "sockets".

Similar to the already present struct sync_io_cb abstraction, why not just add another, something like:

struct sync_sockio_cb {
        int send(void *ctxt, void *buf, int len); // send size bytes to buf, returns -1 on error or +n_sent
        int recv(void *ctxt, void *buf, int size); // recv up to size bytes into buf, returns -1 on error or +n_received
        int close(void *ctxt); // tears down ctxt, never to be used again.
};

So instead of the proposed sync_use_socket() it'd be something like sync_use_sockio(), with callers doing their own heavy lifting:

struct sync_sockio_cb sockio_cb = {
        .send = mysend,
        .recv = myrecv,
        .close = myclose,
};

mysioctxt = my_sio_allocated_and_connected_however_i_want();
sync_use_sockio(device, &mysio, mysioctxt);

Then sync_tcp_connect() would internally just use sync_use_sockio() with its own stuff filled in reusing the existing TCP-oriented send/recv stuff already in librocket. It would perform its own resolution and connect(), allocating a ctxt of its liking for containing that junk, which its send/recv/close would know how to make use of.

I've deliberately omitted connect()-related stuff altogether from sync_sockio_cb, treating that as something handled as part of getting the ctxt ready. When any of the send/recv methods fail, it just gets closed and ctxt invalidated - expecting a new sync_use_sockio() to get things rolling again after whatever calls it gets that ctxt ready.

I think I like this direction! This could also make it possible to disable the TCP code entirely for platforms where only a serial port is available (or the user just wants to use something else, like Unix domain sockets, or just wants to work around a blocking reconnect like the discussion above)!

Just a small nit: since we already have sync_set_io_cb, we should probably call it sync_set_sockio_cb() instead of sync_use_sockio ;)

But this also makes me regret not naming it sync_fileio_cb instead of sync_io_cb, oh well 😆

The above approach conflicts with the proposed sync_reconnect(), which IMO probably shouldn't exist except maybe as sync_tcp_reconnect(), with sync_tcp_connect() and sync_tcp_reconnect() both making use of some tcp-specific state stowed in struct sync_device that nothing else uses outside of those two methods... make the whole connect/reconnect conceptually bound to sync_tcp convenience-helper land.

I made sure to name it sync_tcp_reconnect() in #101, which IMO would already place it in the TCP convenience land 👍

I would still like to somehow fix sync_tcp_reconnect() to timeout fast, if possible. Because I do think that's a bad user experience as well. Perhaps a more targeted solution can be done for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants