Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ruby orphaned client when server returns error response on disconnect #433

Open
JanStevens opened this issue Apr 14, 2016 · 9 comments
Open

Comments

@JanStevens
Copy link

Hello,

I have the following setup, the faye server runs in the cloud and a raspberry pi runs the Faye::Client. For some odd reason the client loses its server connection (for example unplug ethernet cable in another subnet => pi new address but code is still using old connect) anyways shit hits the fan, but thats not my issue.

What I do now is detect this (using ping/pong) and then disconnect the client client.disconnect and create a new client (on the new API) and subscribe to the desired channels.

Now this is the log output from when I'm trying to create a new client and disconnect the old one.

FAYE: [Faye::Client] Disconnecting "nx930ik4ldxtfpxxm6zatrg88i4ghi3"
FAYE: [Faye::Client] Passing through outgoing extensions: {"channel":"/meta/disconnect","clientId":"nx930ik4ldxtfpxxm6zatrg88i4ghi3","id":"8"}
FAYE: [Faye::Transport::WebSocket] Client "nx930ik4ldxtfpxxm6zatrg88i4ghi3" sending message to http://getin.dev:9393/ws_print via "websocket": {"channel":"/meta/disconnect","clientId":"nx930ik4ldxtfpxxm6zatrg88i4ghi3","id":"8"}
FAYE: [Faye::Client] Clearing channel listeners for "nx930ik4ldxtfpxxm6zatrg88i4ghi3"
FAYE: [Faye::Client] New client created for "http://faye.dev:9393/ws_print"
FAYE: [Faye::Client] Initiating handshake with "http://faye.dev:9393/ws_print"
FAYE: [Faye::Dispatcher] Selected "long-polling" transport for http://faye.dev:9393/ws_print
FAYE: [Faye::Client] Passing through outgoing extensions: {"channel":"/meta/handshake","version":"1.0","supportedConnectionTypes":["in-process","websocket","long-polling"],"id":"1"}
FAYE: [Faye::Transport::Http] Client nil sending message to http://faye.dev:9393/ws_print via "long-polling": {"channel":"/meta/handshake","version":"1.0","supportedConnectionTypes":["in-process","websocket","long-polling"],"id":"1"}
FAYE: [Faye::Transport::WebSocket] Client "nx930ik4ldxtfpxxm6zatrg88i4ghi3" failed to send to http://faye.dev:9393/ws_print via "websocket": [{"channel":"/meta/disconnect","clientId":"nx930ik4ldxtfpxxm6zatrg88i4ghi3","id":"8"}]

Notice the last line, the disconnect attempt from the Client fails. Continuing with my horror story, for some reason the server restarts, next I see this:

FAYE: [Faye::Transport::WebSocket] Client "nx930ik4ldxtfpxxm6zatrg88i4ghi3" sending message to http://faye.dev:9393/ws_print via "websocket": {"channel":"/meta/disconnect","clientId":"nx930ik4ldxtfpxxm6zatrg88i4ghi3","id":"8"}
FAYE: [Faye::Transport::WebSocket] Client "nx930ik4ldxtfpxxm6zatrg88i4ghi3" sending message to http://getin.dev:9393/ws_print via "websocket": {"channel":"/meta/disconnect","clientId":"nx930ik4ldxtfpxxm6zatrg88i4ghi3","id":"8"}
FAYE: [Faye::Transport::WebSocket] Client "nx930ik4ldxtfpxxm6zatrg88i4ghi3" received from http://faye.dev:9393/ws_print via "websocket": [{"id":"8","channel":"/meta/disconnect","successful":false,"error":"401:nx930ik4ldxtfpxxm6zatrg88i4ghi3:Unknown client"}]
FAYE: [Faye::Client] Passing through incoming extensions: {"id":"8","channel":"/meta/disconnect","successful":false,"error":"401:nx930ik4ldxtfpxxm6zatrg88i4ghi3:Unknown client"}
FAYE: [Faye::Transport::WebSocket] Client "nx930ik4ldxtfpxxm6zatrg88i4ghi3" received from http://getin.dev:9393/ws_print via "websocket": []

And suddenly I have 2 clients, a newly created client that I actively used and an orphaned client (which is not even referenced in a ruby variable I might add). When I send a message from the server -> client BOTH clients received it.

What I don't understand is that the client stays alive even after the Server said Hey I don't know you at all, for me this should mean that the client should die and close.

I know that this flow might seems very exceptional but it happens a lot, to sum up:

  1. Client 1 connects to server
  2. Client 1 is forced disconnect using: Faye::Client#disconnect
  3. New client 2 is created (@faye = Faye::Client.new)
  4. Server dies/restart
  5. Client 2 exists (addressable using @faye) Client 1 is a ghost client but receives the messages and executes the callbacks

I'm also wondering how long it takes till a server deletes a Client that it has not seen in the past x minutes? Is there a clean up and how do clients handel this?

Regards,

@jcoglan
Copy link
Collaborator

jcoglan commented May 1, 2016

I suspect this is happening because the client only shuts down the dispatcher (the object that deals with messages being sent) if it gets a successful /meta/disconnect response:

@dispatcher.close

This means any other messages that were in flight when the client disconnected could be retried, and this might include polling requests. However, the client should not reconnect after calling disconnect, since that puts it in the DISCONNECTED state immediately (

@state = DISCONNECTED
), and the client should ignore instructions to reconnect when in that state (
if @advice['reconnect'] == HANDSHAKE and @state != DISCONNECTED
@state = UNCONNECTED
).

The bigger question is why you're doing this yourself. The Faye client is supposed to deal with going offline transparently, and managing this process yourself is just likely to make it confused. Is there a reason the client's own reconnection handling is not working for you, and you had to implement this yourself?

@JanStevens
Copy link
Author

Thanks for the reply,

We deploy Faye Client on small Raspberry Pi computers that are placed in remote locations. The internet connection is not always that great. What we noticed is that for some reason the internet connection is lost, the DHCP lease gets expired and then the Raspberry Pi comes back online.

The Pi gets a new IP address and is again connected with the internet (we can reverse SSH to the pi and see it up and running). Faye on the other hand doesn't seem to pick up this change and continues to stay on the "old connection", it loses it connection to the server and does not recover.

The only real solution we have in place now is a restart of our software (fail 'Server lost'). We detect a server lost event by sending ping/pong messages from Faye Server to Faye Client, if we didn't get a response in the last minute or two we fail. The initial plan was to disconnect the faye client and then reconnected, but that didn't work as expected.

Reading your explanation I still feel there is a bug, when the client cannot disconnect from the server and the server doesn't know the client anymore then the client continues the live were one expects it to die.

Regards

@jcoglan
Copy link
Collaborator

jcoglan commented May 30, 2016

It just occurred to me it might help if I explain the client architecture a little; it might help you find out where the problem is. The client is essentially made of three objects:

+--------+        +------------+        +-----------+
| Client o------->| Dispatcher o------->| Transport |
+--------+        +------------+        +-----------+

Client is the object you interact with, the others are internal. They have distinct responsibilities:

  • Client implements the Bayeux protocol in full. It sends messages by giving them to the Dispatcher, and receives them from there too
  • Dispatcher tracks the state of in-flight messages. All it knows about them is they have an id. When told to send something, it will give the message to Transport. When a response arrives (response messages have the same ID), the dispatcher marks the message as sent, and forgets about it. If the transport reports an error during the request, or there's a timeout, the message is queued for retrying.
  • Transport knows how to send messages to/from the server using HTTP/WebSocket. It just knows how to take JSON messages, send them, and hand the response back to the transport.

In light of that let's examine the disconnect method:

    def disconnect
      return unless @state == CONNECTED
      @state = DISCONNECTED

      info('Disconnecting ?', @dispatcher.client_id)
      promise = Publication.new

      send_message({
        'channel'  => Channel::DISCONNECT,
        'clientId' => @dispatcher.client_id

      }, {}) do |response|
        if response['successful']
          @dispatcher.close
          promise.set_deferred_status(:succeeded)
        else
          promise.set_deferred_status(:failed, Error.parse(response['error']))
        end
      end

      info('Clearing channel listeners for ?', @dispatcher.client_id)
      @channels = Channel::Set.new

      promise
    end

send_message passes the message off to Dispatcher, as happens in all the other client methods. The do |response| block will execute when the dispatcher receives a response (any response) from the server. The dispatcher might retry the message multiple times (as well as other client messages) if it doesn't get a reply within sufficient time, but the client isn't aware that's what's happening.

When your client goes offline, there's a bunch of messages in the dispatcher in flight and waiting to be retried. You will see Client X sending message to in the logs whenever dispatcher tries sending each message. If you're offline long enough, the dispatcher will retry any messages it's not had a response to. One of those messages will probably be a /meta/connect which the client uses to poll for messages. So let's say there's a /meta/connect and /meta/disconnect in the dispatcher.

Eventually you come back online. The dispatcher's messages will eventually be retried (in unpredictable order) and this time they'll reach the server. If the server recognises the client (i.e. its session has not expired), then both will get normal successful responses and the if responses['successful'] block above will execute, closing the dispatcher so no more messages can be sent. If the /meta/connect is processed first then you still might receive messages that were queued up for the client while it was offline.

If the server does not recognise the client, its advize method will attach "advice": {"reconnect": "handshake"} to the /meta/connect response.

The client handles that advice only if @state != DISCONNECTED, but calling client.disconnect sets @state = DISCONNECTED as the first thing it does. Likewise, client.connect exits immediately if the client is disconnected.

It seems to me that messages that are in the dispatcher might be retried once you come back online, but if they're unsuccessful (i.e. the session timed out and the server doesn't recognise the clientId as in your logs), the client should still not send or receive any more messages as connect does nothing after disconnect has been called.

Does the above make sense? I realise it might not explain why you're seeing a bug but I thought it might help to explain what I think should be going on in this situation.

@jcoglan
Copy link
Collaborator

jcoglan commented May 30, 2016

I should task: what timeout setting are you using on the server? The checks you're doing with ping/pong are essentially the same as what Dispatcher does: it automatically retries messages if they're not responded to quickly enough, and it bases its timeout on the value advertised by the server. If you think your client is not detecting errors quickly enough, it might be worth lowering your server timeout.

As it is, your application and the Faye client are both trying to do the same thing but without any co-ordination, which might explain the weird effects you're seeing.

@jcoglan
Copy link
Collaborator

jcoglan commented May 30, 2016

Some further thoughts: the unsuccessful /meta/disconnect response means @dispatcher.close is not called, and so the client might be leaving a WebSocket open. On the server (in the RackAdapter class), WebSockets are identified by the last clientId the server saw arrive via that connection, and so it might still be able to receive messages.

However, once a client's session expires (this happens if the server does not hear from the client for 2 * N seconds, where N is the server's timeout setting, which defaults to 60), the server deletes all its subscriptions, so no messages should be being routed to it anyway.

@JanStevens
Copy link
Author

Hello,

Thanks for your explanation this clears up some bits.

Server Timings

@faye = Faye::RackAdapter.new({
      mount: "/#{Printer.config.faye[:path]}",
      timeout: 20,
      ping: 5,
})

Client

Faye::Client.new(Printer.config.faye[:url], interval: 2, retry: 2, timeout: 10)

I'm going to step a little bit back and talk a bit more about why we started with our implementation of Ping/Pong and the manual disconnect (maybe this helps explaining why we got this far down the rabbit hole)

Setup

So we have one faye server instance running on our cloud server. The server will receive jobs and forward them to the correct channel. Every Raspberry Pi in the field has a Faye client that will connect to multiple channels (depending on its environment). When it receives a job from the Faye Server it will execute the job and log a response locally. The Raspberry Pi's are deployed fro 2-3 months in the field and are always on.

Initial problems

For some complete unknown reason we notice that the Faye Client doesn't receive any jobs from the Faye Server. We can trigger manual jobs, they are received on the Faye Server, transmitted to the Client but the Client never receives them. In our client we don't see any indication that it lost it's connection to the Sever.

Our initial idea was that the client has a flaky internet connection so it's normal that they sometimes don't receive a job. But investigating the network we found that the network connection is not always an issue. For example we have run a continues ping on the Raspberry Pi and during a connection lost moment the ping times didn't change at all. We were also able to ssh to the Raspberry Pi's without any problems.

Our measures

We noticed that everything was back oke when we restarted our ruby software. We then implemented the following stuff:

  • Ping/Pong from server to client: This was pure informational but this would give us an indication that something might go wrong. If the client didn't receive a ping message from the server it would log a message
  • Ping/Pong disconnect: When we didn't receive a ping we would manually close the faye connection and force a new one, this didn't work as the Dispatcher will retry the messages in random order and it's possible that a disconnect/connect happen in the wrong order (as you explained).
  • Ping/Pong restart: Our current "solution" we now fail when we don't receive any ping from the server in the last 60 seconds, this will crash our ruby and systemd will restart it

Conclusion

I'll enable full faye logging on a client, maybe this will help pin down our initial problem

@JanStevens
Copy link
Author

Can you send me your email to send the logs too?

They are quite big and contain some sensitive data

You can email me at jan[@]playpass[.]be (remove the brackets ofcourse)

Regards,

@jcoglan
Copy link
Collaborator

jcoglan commented Jun 21, 2016

I don't have an awful lot of time to work on issues at the moment and I suspect that staring at huge log files without access to your system won't get me very far.

The ping/pong mechanism you describe under 'Our measures' is exactly what the client already does using /meta/connect: if it doesn't receive a response to this message within a given time interval, it retries. (This logic applies to all messages, but /meta/connect also serves to tell the server the client is still alive.) If the client is offline for a long time, then it will receive a /meta/connect response telling it to handshake again, which it does and re-registers its subscriptions.

The best thing for you to do is find out what goes on in the transport layer when the Pi goes offline -- does Faye::WebSocket::Client detect a disconnection and emit an event that prompts Faye to reconnect it? If WebSocket does not emit a close event, then Faye will not know it's offline and could keep sending messages to a closed socket without realising or re-opening.

What's happening in the Dispatcher? Is it being told explicitly by the transport that messages are failing, or are they just timing out?

@dhulke
Copy link

dhulke commented Aug 7, 2019

Are are messages replayed in unpredictable order? Is that a behavior of the protocol?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants