Skip to content

Controlled Deployment

Jason Heiss edited this page Feb 11, 2014 · 1 revision

Overview of Controlled Deployment

Your etch configuration repository (/etc/etchserver) is typically a checkout from a version control system (subversion, CVS, etc.) For convenience you'd generally like to have a regular cron job update that working copy, so that anyone authorized to make changes to your etch configuration can commit their changes to your version control system and have them propagate to your etch clients.

However, that has some risks. If, despite your extensive code review, QA, and change management processes, someone manages to commit a bad change you'd rather not have that take effect everywhere right away. (The extensive code review, QA, and change management processes are probably still on your wish list, which just makes controlled deployment all the more important.)

What we suggest is that you tag your repository on a regular basis (generally hourly), identify sections of your environment that are less critical than others, and use the nodetagger script to assign incoming clients to appropriate tags such that changes committed to your version control system are deployed to your environment in a controlled fashion over the course of several hours. This allows you to use portions of your environment as a last-ditch QA system, hopefully giving you time to notice a bad change and stop and revert. And if that level of nonchalance sounds crazy to you then you can use this same mechanism to implement a much more formal release process.

If you have a reasonably large environment with development, integration, QA , staging and production systems you could use a schedule like the following:

  • 0 hours: Development and Integration
  • 1 hours: QA
  • 2 hours: Staging
  • 3 hours: A portion of production
  • 4 hours: Remaining hosts

The set of production hosts at hour 3 should be chosen such that the loss of those systems is an anticipated failure in your failover/BCP strategy. For example, if you have production hosts in multiple data centers and do global load balancing across those data centers than you might use one whole production data center. The loss of a single datacenter is probably expected under your failover strategy, with your other datacenter(s) sized to take over that traffic. So the loss of serving from that datacenter due to a bad system configuration, while something you'd really like to avoid, is a good "canary in the coal mine".

Obviously the right schedule depends greatly on your environment. For example, if you don't have a 24x7 Operations Center then this schedule might not give you enough time to notice problems. I.e. if someone commits a change from home at 9pm it might not start having an effect until 10 or 11, and if nobody is awake to notice then this whole setup won't do you any good. In that case you might need to spread the schedule out so that it is 10 or 12 hours from start to finish. Or if your operation has a change management process you might need to require a human review before changes are released to production, or even from one pre-production environment to the next.

Implementation

First up you need something tagging your repository hourly so that you have distinct copies of the repository that you can assign to clients. These can be real tags in your version control system, or just copies of the repository on the etch server(s). We actually recommend that later, as making real tags in your version control system introduces a lot of change data that has very little value, obscuring real changes by real people. The server/configs/repo_update script does this, automatically making hourly copies of the repository in /etc/etchserver/tags named like etchautotag-YYYYMMDD-HH00

Then you need to assign incoming clients to one of these tags. The hook etch provides for this is the /etc/etchserver/nodetagger script. When a client connects it can supply a tag that it would like to receive, otherwise the server runs the 'nodetagger' script. The nodetagger script is passed the client's hostname as the first (and only) command line argument, and is expected to return a relative path under /etc/etchserver which contains the configuration repository to use for the client. The distribution comes with a nodetagger script in the etchserver-samples/nventory/ directory which implements this scheme. The script does make calls to nVentory to determine which environment a node is in and to implement other features, so if you are not also using nVentory you'll need to modify it to work in your environment. Tags are assumed to be good, but can be marked bad or badfwd in the tagstate file that lives at the root of the repository. (An example tagstate file comes in the same directory.) The provided script checks to see if the client has been assigned a static tag via the config_mgmt_tag field in nVentory, otherwise it calculates the autotag for the client based on the above schedule.

If you notice that you or someone else has made a bad commit you should take the following steps:

  • Activate the killswitch (see the info about the killswitch below)
  • Mark the associated tag(s) as bad or badfwd in tagstate
  • Commit a fix
  • Remove the killswitch file.

Clients that would have gotten the bad tag will be rolled back to the last non-bad tag until a new non-bad tag comes along, or forward to the next non-bad tag in the case of badfwd.

An explanation of 'bad vs 'badfwd' is probably in order. The rollback behavior associated with marking a tag as 'bad' doesn't work well when the bad change is a new file and the last not-bad tag doesn't contain any configuration for that file so the bad configuration remains in place until the fixed configuration catches up, which could be 5 hours. In those cases you can mark the bad tags with 'badfwd' if you want to have clients roll forward right away to the next not-bad tag so they pick up the fixed configuration right away.

Here's a scenario to explain why you want to be able to mark tags as bad. Imagine you check in a bad change at 0800. Around 1000 you notice that your dev and qa environments are broken and commit a fix. That fix ends up in the 1100 tag. However, staging and production are still going to get the 0800, 0900 and 1000 tags before they get to your 1100 tag with the fix. You need a way to tell the system to skip over those bad tags. If you mark 0800, 0900 and 1000 as bad then dev and qa will revert to 0700 (the last non-bad tag), and staging and production will hold at 0700. Then the 1100 tag will work its way through the environments as usual. Disaster averted.

Killswitch

If you need to stop all etch activity you can create /etc/etchserver/killswitch on your etch server(s). If that file exists then the etch server will send its contents to the client and abort. The contents are optional, but can be used to specify a message about why the killswitch has been activated.