Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fallback transport mode #441

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

jkarneges
Copy link
Contributor

This patch introduces a new option called transportMode, which can be used like this:

var client = new Faye.Client('http://example.com/bayeux', {
  transportMode: 'fallback'
});

When set to fallback, messages after the handshake will wait for the first available preferred transport.

@jkarneges
Copy link
Contributor Author

I've reworked the patch against master so it cleanly applies, however it doesn't quite work yet.

How do I build the packed JS file properly? Right now I'm doing it like this:

node_modules/.bin/webpack ./webpack.config.js src/faye_browser.js faye-browser.tmp.js
echo "var Faye =" > faye-browser.js && cat faye-browser.tmp.js >> faye-browser.js

@jcoglan
Copy link
Collaborator

jcoglan commented Jun 28, 2016

To build the client package, you run make in the project root.

@jkarneges
Copy link
Contributor Author

jkarneges commented Jun 28, 2016

Thanks. Ok so I think my code might be fine.

What I'm noticing is that if WebSocket connectivity becomes unavailable while the websocket transport is in use, then Faye isn't falling back to long-polling, but I'm seeing this on master as well without my modifications. Is that a bug? Edit: I suspect this might be intended behavior. selectTransport() is only ever called before and after a handshake, so Faye becomes locked on to the most recently selected transport. So you can ignore this or we can possibly discuss it in a different issue. Please review my patch then. :)

@jcoglan
Copy link
Collaborator

jcoglan commented Jun 28, 2016

I should add I have significant reservations about this patch. It adds a lot of complexity to Dispatcher that makes it harder for Dispatcher to do its job, and I'm not sold on the benefit of delaying messages until a transport is established. It isn't really about 'upgrade' vs 'fallback'; the Faye client already selects transports in preference order and in that sense uses a 'fallback' strategy, it just uses a transport it knows will work while that's happening. All this patch does is delay messages that would be sent sooner, if you kept using the initial transport.

I'll try to explain my worries about complexity. Dispatcher was introduced to solve a serious problem with error handling, which has been one of the hardest things in Faye to get right. (I would say I didn't get it right until 1.1, when this class was introduced.) The class exists to explicitly track the state of every in-flight message -- whether it's been sent, we're waiting for a response, we got a response, we got an explicit failure, we got a timeout, we're waiting to send the message again.

Before this class was introduced, all that state was implicit: it was hidden in the call stack, in timers, inside promise callback lists, etc. This resulted in serious bugs that were very hard to fix, and manifested in many different ways, prompting people to put all sorts of sticking plasters over the symptoms rather than fixing the actual problem. The worst of these was that in 1.0, the client could be triggered into resending all its previous messages over and over in a tight loop, regardless of whether they succeeded, flooding the server with junk duplicate messages.

Now that this class exists, there is one place where the state of every message is known. It means that any report of an error or timeout can be dealt with correctly based on that explicit state. It means users can plug in their own retry strategies without breaking anything. And it means that error recovery has unit tests, which we never managed before. The tests demonstrate certain things like calling sendMessage() twice with the same message will only result in it being sent once, unless it's explicitly failed or timed out, at which point you can try again, once.

The state machine around each message currently assumes there will always be a transport you can use, which is currently true. It can assume that once you call sendMessage(), the message is immediately sent to the transport layer. Under this patch, that's no longer true; the message might need to first wait for a transport to be available, causing numerous issues.

First: this state is not tracked anywhere. When a message is given to the dispatcher, it should be recorded in this._envelopes. This is now deferred until _sendMessage() is called. If you call the public sendMessage() twice with the same message before a transport is selected, it puts two callbacks in the queue. This might not be a problem, but it depends very strongly on how the queue is implemented (more on that later) and is liable to be broken easily. At the very least it's impossible to test because it captures state in callbacks, the source of all the old bugs.

Second, the message is in an indeterminate state while we're waiting for a transport, which is subtly different from the previous problem. A message should always be in one of the states I listed earlier, explicitly, so we know what to do if an error/timeout happens. Now that there's an extra async component to sending a message (waiting for a transport), we need a state for that and more cases to handle. What if waiting for a transport times out? What if we treat a message as 'sending' while we're waiting for a transport, and it times out, but then after that the transport resolves? In this case we should have cancelled the first send and put the message in a retry queue, but with this implementation the pending callbacks cannot be cancelled -- this was another source of bugs before. If a message changes state, any pending actions/timeouts tied to that message must be cancelled to avoid things accidentally happening twice or more. We could implement such logic, but I don't believe the additional complexity in the state machine and possible strange interaction with user-defined schedulers is worth it.

Third, what you've essentially implemented here is a promise, except you've not used an promise, so you've had to resolve some of the same problems, such as re-entrant callbacks. That complexity really doesn't belong here, and fixing it will likely make the unit tests more complicated. For example, where all the tests are currently synchronous and rely on interaction with a synchronous state machine, some would need to become async, and reasoning about an async state machine is harder.

(In any case, a promise would be the wrong thing too, because it would capture state in functions rather than in _envelopes. A better approach would be to put a message in _envelopes immediately to track its state, and when a transport is selected, find all messages waiting to send and send them.)

Fourth, it breaks an abstraction boundary between Client, Transport and user extensions. By inspecting message.channel and setting message.connectionType in this class, you're making it aware of the Bayeux protocol in a way it wasn't before: it only really knew about "messages" as objects with an id field. You've spread the responsibility for implementing the protocol between two classes, and this has practical knock-on effects for users. When a user writes an extension, it is assumed that the message won't be modified once it enters the transport layer. This is no longer true, the message might be modified multiple times if it's retried while changing transports.

In summary, removing the assumption that there is a transport introduces a lot of new possible scenarios that need explicit handling, and I am not convinced yet that the benefit of the feature is worth the additional complexity. This is especially true when you consider that this transport deselection does not only happen on first connection, but any time the client's session is timed out and it has to re-handshake. There's even the possibility of multiple selection processes racing each other. (This is true of the current design, but it doesn't matter because message state is not attached to it. As long as some transport exists at all times, there's no problem.)

In order to accept this PR, I'd need a very strong case for the feature's benefits, and a more robust implementation that addresses the above problems with a reasonable amount of complexity.

@jcoglan
Copy link
Collaborator

jcoglan commented Jun 28, 2016

What I'm noticing is that if WebSocket connectivity becomes unavailable while the websocket transport is in use, then Faye isn't falling back to long-polling, but I'm seeing this on master as well without my modifications. Is that a bug?

Once Faye has successfully made a WebSocket connection to an endpoint, it will keep using WebSocket and not downgrade its transport. Usually, WebSocket disconnections are transient issues with the server or network, which tend to make other transports unusable as well. In the vast majority of cases this is a workable assumption.

@jkarneges
Copy link
Contributor Author

Thank you so much for the detailed explanation of how things are supposed to work. I'll look at reworking the patch to address all these issues.

As for the feature's benefits, it fixes the "waiting for server" cosmetic issue that can occur on complex websites that have concurrent network interactions while Faye is initializing. For certain users whose audiences primarily use recent browsers, fixing this issue may be worth whatever drawback might be inflicted on people using older browsers.

In fact one user has indicated they would be fine with completely breaking old browsers in order to fix this issue. That may seem like an extreme stance to have but web developers do this kind of thing all the time. Any website that says "requires browser XYZ or later" has basically done this. My latest patch is a compromise. It'll still work on old browsers after a delay waiting for WebSocket to fail.

@jcoglan
Copy link
Collaborator

jcoglan commented Jun 28, 2016

It's possible there is another solution to the cosmetic issue that won't require accumulating complexity in such a core component. Maybe it could involve setting which transports you want to accept.

I assume the spinner comes from callback-polling since you're loading a <script> tag? Or do you get this with long-polling (XMLHttpRequest) too?

Do you know what would happen for users unable to connect via WebSocket? Would you rather they saw a spinner, or just had a broken connection? My stance is that I'd much rather fix things for all users out of the box, than provide switches to deliberately break things in special cases.

@jkarneges
Copy link
Contributor Author

jkarneges commented Jun 28, 2016

Spinner still happens with long-polling (XHR), unfortunately.

WebSocket does not seem to cause a spinner if the connection takes a long time. Even if a spinner did appear in this case, I think it would be acceptable behavior for anyone who is desiring the changes we are discussing.

@faye faye added the Transport label Aug 2, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants