Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

statsrelay faster performance #69

Open
simonhf opened this issue Mar 8, 2017 · 0 comments
Open

statsrelay faster performance #69

simonhf opened this issue Mar 8, 2017 · 0 comments

Comments

@simonhf
Copy link

simonhf commented Mar 8, 2017

Hello! I wanted to use statsrelay in order to feed multiple statsite instances BUT for various reasons I can only currently have one statsrelay process / box.

I forked udprelay [1] on GitHub and hacked it so that I can replay a large pcap file containing very many statsite UDP packets at various speeds on a test / staging system. I noticed that because statsrelay is single threaded, does not have a greedy UDP read function, and does not increase its UDP socket read buffer, then I can only reliably read a few 10k packets per second with 1,500 MTU sized packets without hitting 100% CPU and/or having UDP packet drops as reported by netstat -anus.

So I hacked statsrelay [2] to add those features but the ability to show in real time the packets per second read. It's now able to handle a lot more traffic (see below).

What is a 'greedy' read? Greedy read is when ev / epoll calls your callback function saying that there's a UDP packet to be read. statsrelay dutifully reads one packet and then returns. However, a greedy read will continue to read packets until read() says there are no more packets to read :-) This saves a bit of CPU because there doesn't not need to be the overhead of one callback per packet read.

Why would packets be dropped by statsrelay? The network stack receives the UDP packets and places them on a queue. The same queue that statsrelay reads from. However, that queue is not infinite and if the queue is too big then the network stack simply 'drops' the packet because it has nowhere to put it. netstat -anus shows the number of dropped packets as errors. Therefore, it is critical that statsrelay read and process the incoming packets faster than they are being sent... otherwise the queue will fill up and packets get dropped. So by fork()ing statsrelay then more processes are available for reading, and by increasing the queue size from the network stack default of 124 KB to 32 MB then a lot more scheduling bumps can happen before packets get dropped :-)

I added an option so that it can fork() itself a number of times after starting to listen on port 8125. This means that all forked instances listen on the same port... so that if one is busy processing then another instance reads the packet.

FYI the Python tests still work but only if the fork count is zero. If somebody finds this interesting, maybe they can hack the tests to test with a fork count > 0 ? Thanks!

Using 4 statsrelay instances pointing at 10 statsite instances on an Amazon c3.8xl box then it can process about 84k PPS @ 9k MTU with no dropped packets and each statsrelay using about 79% CPU. That's about 720 MB/s or on the way to maxing out the 10 Gbit NIC.

Posting this here in case it's useful for somebody else :-)

[1] simonhf/udpreplay@40e026a
[2] simonhf@a0b34ac

theatrus referenced this issue in lyft/statsrelay Oct 29, 2017
theatrus referenced this issue in lyft/statsrelay May 1, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant