statsrelay faster performance #69

simonhf · 2017-03-08T02:16:04Z

Hello! I wanted to use statsrelay in order to feed multiple statsite instances BUT for various reasons I can only currently have one statsrelay process / box.

I forked udprelay [1] on GitHub and hacked it so that I can replay a large pcap file containing very many statsite UDP packets at various speeds on a test / staging system. I noticed that because statsrelay is single threaded, does not have a greedy UDP read function, and does not increase its UDP socket read buffer, then I can only reliably read a few 10k packets per second with 1,500 MTU sized packets without hitting 100% CPU and/or having UDP packet drops as reported by netstat -anus.

So I hacked statsrelay [2] to add those features but the ability to show in real time the packets per second read. It's now able to handle a lot more traffic (see below).

What is a 'greedy' read? Greedy read is when ev / epoll calls your callback function saying that there's a UDP packet to be read. statsrelay dutifully reads one packet and then returns. However, a greedy read will continue to read packets until read() says there are no more packets to read :-) This saves a bit of CPU because there doesn't not need to be the overhead of one callback per packet read.

Why would packets be dropped by statsrelay? The network stack receives the UDP packets and places them on a queue. The same queue that statsrelay reads from. However, that queue is not infinite and if the queue is too big then the network stack simply 'drops' the packet because it has nowhere to put it. netstat -anus shows the number of dropped packets as errors. Therefore, it is critical that statsrelay read and process the incoming packets faster than they are being sent... otherwise the queue will fill up and packets get dropped. So by fork()ing statsrelay then more processes are available for reading, and by increasing the queue size from the network stack default of 124 KB to 32 MB then a lot more scheduling bumps can happen before packets get dropped :-)

I added an option so that it can fork() itself a number of times after starting to listen on port 8125. This means that all forked instances listen on the same port... so that if one is busy processing then another instance reads the packet.

FYI the Python tests still work but only if the fork count is zero. If somebody finds this interesting, maybe they can hack the tests to test with a fork count > 0 ? Thanks!

Using 4 statsrelay instances pointing at 10 statsite instances on an Amazon c3.8xl box then it can process about 84k PPS @ 9k MTU with no dropped packets and each statsrelay using about 79% CPU. That's about 720 MB/s or on the way to maxing out the 10 Gbit NIC.

Posting this here in case it's useful for somebody else :-)

[1] simonhf/udpreplay@40e026a
[2] simonhf@a0b34ac

The text was updated successfully, but these errors were encountered:

Allows you to make an asan image: https://clang.llvm.org/docs/AddressSanitizer.html NOT PASSING TESTS

theatrus referenced this issue in lyft/statsrelay Oct 29, 2017

Allow building in Docker for a asan enabled build (#69)

c2e11e2

Allows you to make an asan image: https://clang.llvm.org/docs/AddressSanitizer.html NOT PASSING TESTS

theatrus referenced this issue in lyft/statsrelay May 1, 2020

Allow building in Docker for a asan enabled build (#69)

098c047

Allows you to make an asan image: https://clang.llvm.org/docs/AddressSanitizer.html NOT PASSING TESTS

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

statsrelay faster performance #69

statsrelay faster performance #69

simonhf commented Mar 8, 2017 •

edited

statsrelay faster performance #69

statsrelay faster performance #69

Comments

simonhf commented Mar 8, 2017 • edited

simonhf commented Mar 8, 2017 •

edited