Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connection.go:265 - unable to queue heartbeat. fail now. TODO: FIXME #4

Open
justinfx opened this issue Jul 21, 2011 · 7 comments
Open

Comments

@justinfx
Copy link
Contributor

Let me just start by saying thank you so much for this library. I am learning Go right now by porting a Message Server over from python (Tornado, TornadIO, ZMQ) to Go and I'm really excited by these bindings so far!

I'm hitting one of your Fix Me's in connection.go:265
c.sio.Log("sio/keepalive: unable to queue heartbeat. fail now. TODO: FIXME", c)

What I am doing is rapid fire throughput tests with a command line client. When I use only a single client, I dont hit this wall. When I run 2 concurrent connections, the messaging hangs on this for a moment and ultimately then finishes, returning these errors after all the queued messages have been delivered.

My question is mainly about what is really happening here. Does it have anything to do with the heartbeats being queued on the same channel as the messages, and they are saturating the buffer to where heartbeats cant get through? Or is it a matter of raising the queue length value?
Is it a situation where ideally the heartbeat needs to be suspended while there is actual activity?

Thanks!
-- justin

@madari
Copy link
Owner

madari commented Jul 22, 2011

I'm glad to hear that go-socket.io has been useful for you so far!

You are quite correct with your diagnosis. Heartbeats use the same queue as all the other messages.
What's happening here is that the client is reading too slowly from the server or the server is pumping out too
much messages (maybe it would be better to test the throughput using a PING-PONG model). In other words,
the flusher is then basically spending too much time writing the messages out => the queue gets indeed fully saturated.

The current solution (= the FIXME) is less than suboptimal. I think the mechanism should give the
flusher more time to write the messages instead of just giving up immediately if the queue isn't writable.
Also as you said, it might be a good idea to put heartbeats on hold when there's activity going on anyway.

I'm going to look into this later today.

The queue length in the default configuration (= 10) might be ridiculous - I haven't given it much of a though yet.
Tip for the day is that you could try to raise the queue length to something big (~1000?) and give it another try.
This is what I'm always doing when benchmarking and it seems to make things run smoother.

Thanks!

@justinfx
Copy link
Contributor Author

I tried increasing the QueueLength to a ridiculously high number like 100k, and the issue does seem to stop. So my guess is that it IS just too many messages saturating and blocking out the heartbeats. What do you think about separating messages into one channel, and heartbeat/disconnect/handshake into like a priority channel that comes first in the select statement in the flusher?
And for the heartbeats, maybe they would be suspended when the client is sending messages since we know they are alive. TornadIO does this and waits 10 seconds after the last known message before resuming heartbeats again.

@justinfx
Copy link
Contributor Author

When you get back into this project again, can you take a look at my work on connection.go?
justinfx@16ef3a4#connection.go

It seems to work, in that its not backing up the heartbeats anymore. But I wanted to have you code-review it, since my Go experience is so little :-) I dont want to send a pull request until I'm confident that its good code.
What I'm seeing now when I run my benchmark is that with concurrent clients (usually around 4 clients in goroutines) I am getting broke pipe errors in the Client.Send(). I haven't tracked it down yet but it seems the server is disconnecting them for whatever reason. It never does it with only a single client connection running. I'm probably doing something stupid.

@justinfx
Copy link
Contributor Author

Nevermind on the broken pipe issue. It was something with the Client struct and me sending messages as strings. It was getting an "invalid argument" somehow during the Encode of the string. When I switched it to passing a struct for json marshalling it didnt have a single issue. With the fix I made, I seem to be able to run 10 concurrent clients firing at highspeed with no hiccups or disconnects.

@justinfx justinfx reopened this Jul 31, 2011
@justinfx
Copy link
Contributor Author

I have submitted a pull request that addresses this, along with some other features.

@madari
Copy link
Owner

madari commented Aug 26, 2011

Can you post your test client somewhere?

@justinfx justinfx reopened this Aug 26, 2011
@justinfx
Copy link
Contributor Author

Here is a stress test client I threw together:
https://gist.github.com/1174699

Its hard-coded to send 10k message per client, but the rest of the options are exposed as arguments.
I was actually getting the following Send ERRORs with too many message or too large of messages:
broken pipe & short read

You can play with the numbers.
I was just noticing how the system never reclaims the memory once clients disconnect. Evan Shaw informed me that apparently Go does not currently release ram back to the system?
Using this test, you can see if the ram does gradually keep growing over time with constant high traffic throughput. I noticed also how much the config.QueueLength affects the ram. If I turn it up to very high number, you get INSTANT ram saturation.

Usage:

gotest -test.run="StressTest" host:port intLoops intClients intBytes

Example:

gotest -test.run="StressTest" localhost:9999 5 10 1000

This would loop the whole process 5 times, connecting 10 clients each time, that each send 1000 byte messages

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants