Skip to content
This repository has been archived by the owner on Feb 18, 2021. It is now read-only.

Question: Highly available statsd #56

Open
SegFaultAX opened this issue Apr 22, 2015 · 5 comments
Open

Question: Highly available statsd #56

SegFaultAX opened this issue Apr 22, 2015 · 5 comments

Comments

@SegFaultAX
Copy link

Hello statsrelay team!

I'm curious about how Uber runs statsd in production, in particular:

  1. How many statsrelays do you have in front of your statsd cluster?
  2. How are statsrelays+statsds configured for high availability?
    • 1 primary, N standbys, keepalived/etc. to coordinate automatic failover
    • N primaries, tcp load balancer (haproxy/etc.), coordinated virtual shard configuration updates
    • something else?
  3. If you're running multiple relays that can accept writes, how do you coordinate configuration updates so all relays get the updated shard configuration at the same time? How do you handle network partitions between relays and statsd? What if only a subset of the relays see the partition?
  4. How do you scale your cluster? If the cluster is elastic, how do autoscale your cluster?
  5. How are you monitoring statsrelay? statsd? How do you detect unhealthy statsds and rebalance the shard configuration to compensate?

Unfortunately it seems that not very many people are talking about running HA statsd, at least relative to the number of people ostensibly using statsd. I'm encouraged to see that Uber has dedicated a significant amount of time to making this possible, so any insight on how your architecture has worked out in practice would be hugely valuable.

Thanks for your time and awesome work!

  • Michael-Keith
@JeremyGrosser
Copy link
Contributor

Thanks for reaching out! I was the original author of statsrelay and can answer some of these questions, but I'm no longer at Uber so things might've changed a bit since I left. @eklitzke or @sloppyfocus can correct me where I'm wrong.

Every production server runs an instance of statsrelay, which applications connect to locally (using either TCP or UDP, doesn't really matter because loopback never drops packets).

Those statsrelays are all configured with 4096 shards assigned evenly across 8 physical hosts, with 12 instances of statsite running on each host (one for each physical CPU), each listening to a different port.

All of those statsite instances are configured to send carbon metrics to haproxy running on localhost, which round robins across those same statsrelay instances on their carbon relay port. We use statsrelay's carbon relaying here simply because it's more performant than carbon-relay... It also means one less daemon to worry about.

Finally, those carbon statsrelays use 4096 logical shards and 8 (maybe more?) hosts running carbon-cache for the actual storage.

This whole system isn't foolproof and most of the changes to scale it out or replace a dead server are still relatively manual. A lot of it is aided by existing automation we had in place for haproxy and the like.

We never made a concerted effort to coordinate updates because we were willing to accept a few minutes where relaying was inconsistent. By design, statsrelay will only reload it's config on SIGHUP, so it wouldn't be terribly difficult to build a system to update the files on all of your relay hosts, then SIGHUP roughly all at once. Worst case scenario, you'll have a conflict for a single metric interval (10 seconds by default in statsite).

If you connect to a running statsrelay and send "stats\n" it will respond with some counters showing number of relayed/dropped/buffered lines. We ingest these with the same scripts we use to collect server cpu/disk/memory stats every 60 seconds and feed them back into the graphite system and trigger nagios alerts if anything exceeds a threshold... You monitor statsrelay pretty much the same way you'd monitor memcache.

As you pointed out, netsplits are a thing that happen. statsrelay doesn't really try to address this problem other than by using it's reconnect and queuing logic to try to minimize the impact of a network outage. Depending on the volume of metrics you're sending, statsrelay may hit its queue size limit and drop data... This seemed like the most sensible behavior that didn't involve trying to buffer things to disk. You can adjust the queue size through a config option.

There's still a lot of work to be done to make it easier to get this all setup and scale it out, but the building blocks are mostly there. Personally, I'd love to see someone try to reproduce and autoscale this setup on top of Kubernetes. After that, carbon-cache would be the next part of the stack to try to gut and make more scalable. Migrating data between hosts is really a pain in the ass. But that's another project entirely...

@SegFaultAX
Copy link
Author

@JeremyGrosser Thank you for the fantastic reply! Would you mind clarifying further the following points:

All of those statsite instances are configured to send carbon metrics to haproxy running on localhost, which round robins across those same statsrelay instances on their carbon relay port.

Which statsrelay instances are you referring to? From what you've described so far, the packet flow looks something like:

application -> local statsrelay -> statsite cluster -> haproxy -> ???

When you say "those same statsrelay instances", do you mean the instance of statsrelay running on the statsite boxes, back the on the application servers, or somewhere else?

We never made a concerted effort to coordinate updates because we were willing to accept a few minutes where relaying was inconsistent.

What are you/they using to roll out changes to the configuration? With a large-ish cluster of application servers, it seems like the upper bound on how long the relays can have incorrect configuration is quite long without coordination. While they're out of sync, it's very possible that you're clobbering data in graphite since it can only accept a single write to a given {series, interval} bucket.

If you connect to a running statsrelay and send "stats\n" it will respond with some counters showing number of relayed/dropped/buffered lines. We ingest these with the same scripts we use to collect server cpu/disk/memory stats every 60 seconds and feed them back into the graphite system and trigger nagios alerts if anything exceeds a threshold...

This seems like a potentially huge number of checks given that statsrelay is running on every server. What kinds of checks are in place and what kinds of alerts on statsrelay do operators typically need to respond to?

Again, thank you very much for your time and the detailed reply! It would be awesome to capture this discussion in the documentation as "best practices" for running statsrelay/statsite/graphite at scale.

Cheers,
Michael-Keith

Edit: Spelling is hard :), Edit2: More typos.

@drawks
Copy link

drawks commented Jun 2, 2015

This is a very interesting discussion of your configuration, thank you for that. I'm curious though how/if you handle replication. Is that just farmed out at the whisper layer using statsite sink logic? Could/should/would it be a good idea to add the notion of replica shards to statsrelay?

@drawks
Copy link

drawks commented Jun 2, 2015

@eklitzke @sloppyfocus any chance either of you can shine some light on this?

@JeremyGrosser
Copy link
Contributor

@SegFaultAX sorry, I mixed up my words... I was referring to a separate set of statsrelay instances acting as carbon relays. We used Puppet to roll out changes to the configuration, and you're right, that leads to some period of time where the hashing is inconsistent, but that was generally something we considered acceptable until we got around to building something that could update all the configs then SIGHUP all the statsrelay instances all at once. It'll never be perfect, but it should be possible to get them to all reload within a few hundred ms of each other.

We ended up with two nagios checks for statsrelay on each host... One that would try to connect to statsrelay, get the output of the stats command, then return ok unless statsrelay reported queuing/dropping a lot of lines to backends with some reasonable thresholds (eg. dropped_lines < 10, queues for all backends < 100). The other nagios check looked at RSS memory usage for statsrelay and alerted if it was getting close to the 1GB ulimit set by the init script... This was to protect/warn against memory leaks that occurred as a regression with a couple of older releases of statsrelay.

@drawks Uber's setup (at least, as far as I know) does not do any automated replication... @sloppyfocus wrote some scripts that use the stathasher binary to figure out which files need to be copied around when reorganizing the ring, but it still feels like a very manual process.

I like the idea of replica shards. It's not something we'd really talked about as we expected to eventually solve this by replacing carbon with something better (probably write something new).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants