Large `flush_interval` causing odd synchronization behavior across all clients #215

ghost · 2016-10-15T02:14:32Z

After running statsite for a few days with a flush interval of one hour, all clients seem to be synchronizing their flush times after being somewhat randomly distributed to begin with.

After two days of uptime, it seems as if roughly 80+% of clients are logging at exactly the same time, exactly 30m into the hour. They were somewhat evenly distributed in [0m, 40m] to begin with. Is there anything that is being done that is inherently causing this synchronization? This presents an issue as when logging to a 3rd-party service from statsite all requests are hitting the service at the same time when it would be significantly better to have them be more evenly distributed. I could combat it with a random sleep on my end.

The image below is taken over roughly 48 hours. Each row in the image has its own scale, so the bar heights don't mean much; what's interesting is the distribution of the clients' log attempts during the hour interval. As time goes on (downwards), we can see the clustering around the 30m mark on the hour. If one were to zoom in on the cluster at the 30m mark, it is incredibly tight (within 30s) of exactly the 30m mark.

The text was updated successfully, but these errors were encountered:

luca3m · 2016-10-17T15:33:17Z

I think it's somewhat related to what I reported here. Statsite calculate flushtime relatively to the first run, not absolutely every defined period. So the execution time of the flush function may cause distortion on the long run.

My suggestion is to have an absolute timer.

ghost · 2016-10-17T20:31:18Z

@luca3m Reading through your ticket I also would prefer an absolute timer interval as opposed to a relative one. I believe that this behavior is being caused by something in the ae code as well, and have spent a bit of time looking through it attempting to find a rounding error or decimation or similar that could explain this behavior. At the moment, though, I'm at a bit of a loss.

luca3m · 2016-10-17T21:22:08Z

ae leaves interval management to the user. I think it's possible to have an absolute timer by returning the correct next timestamp from the callback here.

ghost · 2016-10-17T21:46:19Z

I am not too sure about portability; but would it not be easier to just implement the timer system using the POSIX timer_create function? It seems like we're using pthreads for the flush callback function already, and timer_create creates a thread on each callback which would be a similar amount of overhead. I guess it would be a pain to use ae for one set of systems and then timer_create for others, but I think if you really want real-time, synchronized logging it could be easy to implement.

Using timer_create and timer_settime have a nifty trick of also being able to really simply synchronize the first firing of the timer to the system clock, as can be seen in the man pages here. If you use TIMER_ABSTIME, you can set the it_value to be an absolute time on the clock; such as the next flush_interval-aligned monotonic clock time and then set the interval to flush_interval. This would stop us from having to do the math to get the alignment right; which probably won't be as accurate.

sleepybishop mentioned this issue May 22, 2017

align flush interval to clock #251

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large `flush_interval` causing odd synchronization behavior across all clients #215

Large `flush_interval` causing odd synchronization behavior across all clients #215

ghost commented Oct 15, 2016

luca3m commented Oct 17, 2016 •

edited

ghost commented Oct 17, 2016

luca3m commented Oct 17, 2016

ghost commented Oct 17, 2016

Large flush_interval causing odd synchronization behavior across all clients #215

Large flush_interval causing odd synchronization behavior across all clients #215

Comments

ghost commented Oct 15, 2016

luca3m commented Oct 17, 2016 • edited

ghost commented Oct 17, 2016

luca3m commented Oct 17, 2016

ghost commented Oct 17, 2016

Large `flush_interval` causing odd synchronization behavior across all clients #215

Large `flush_interval` causing odd synchronization behavior across all clients #215

luca3m commented Oct 17, 2016 •

edited