Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large flush_interval causing odd synchronization behavior across all clients #215

Open
ghost opened this issue Oct 15, 2016 · 4 comments
Open

Comments

@ghost
Copy link

ghost commented Oct 15, 2016

After running statsite for a few days with a flush interval of one hour, all clients seem to be synchronizing their flush times after being somewhat randomly distributed to begin with.

After two days of uptime, it seems as if roughly 80+% of clients are logging at exactly the same time, exactly 30m into the hour. They were somewhat evenly distributed in [0m, 40m] to begin with. Is there anything that is being done that is inherently causing this synchronization? This presents an issue as when logging to a 3rd-party service from statsite all requests are hitting the service at the same time when it would be significantly better to have them be more evenly distributed. I could combat it with a random sleep on my end.

The image below is taken over roughly 48 hours. Each row in the image has its own scale, so the bar heights don't mean much; what's interesting is the distribution of the clients' log attempts during the hour interval. As time goes on (downwards), we can see the clustering around the 30m mark on the hour. If one were to zoom in on the cluster at the 30m mark, it is incredibly tight (within 30s) of exactly the 30m mark.

request distribution

@luca3m
Copy link
Contributor

luca3m commented Oct 17, 2016

I think it's somewhat related to what I reported here. Statsite calculate flushtime relatively to the first run, not absolutely every defined period. So the execution time of the flush function may cause distortion on the long run.

My suggestion is to have an absolute timer.

@ghost
Copy link
Author

ghost commented Oct 17, 2016

@luca3m Reading through your ticket I also would prefer an absolute timer interval as opposed to a relative one. I believe that this behavior is being caused by something in the ae code as well, and have spent a bit of time looking through it attempting to find a rounding error or decimation or similar that could explain this behavior. At the moment, though, I'm at a bit of a loss.

@luca3m
Copy link
Contributor

luca3m commented Oct 17, 2016

ae leaves interval management to the user. I think it's possible to have an absolute timer by returning the correct next timestamp from the callback here.

@ghost
Copy link
Author

ghost commented Oct 17, 2016

I am not too sure about portability; but would it not be easier to just implement the timer system using the POSIX timer_create function? It seems like we're using pthreads for the flush callback function already, and timer_create creates a thread on each callback which would be a similar amount of overhead. I guess it would be a pain to use ae for one set of systems and then timer_create for others, but I think if you really want real-time, synchronized logging it could be easy to implement.

Using timer_create and timer_settime have a nifty trick of also being able to really simply synchronize the first firing of the timer to the system clock, as can be seen in the man pages here. If you use TIMER_ABSTIME, you can set the it_value to be an absolute time on the clock; such as the next flush_interval-aligned monotonic clock time and then set the interval to flush_interval. This would stop us from having to do the math to get the alignment right; which probably won't be as accurate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant