Skip to content

Managing amazon cloudfront distribution

pkra edited this page May 15, 2012 · 2 revisions

This is now archival.

Key Concepts

CloudFront is the part of Amazon Web Services for CDN hosting. One defines a distribution, which consist of an origin which is the source of the data that is to be mirrored through the network of CloudFront servers. The origin can either be data hosted on by the AWS S3 service, or it can be your own web server, which is a "custom origin" for CloudFront purposes. Each distribution is assigned a domain name liked 3eoax9i5htok0.cloudfront.net (our current distribution) but a DNS CNAME can be assigned so that it answers to something nice like cdn.mathjax.org.

When a request to cdn.mathjax.org comes in from a user agent, CloudFront routes it to a nearby server in the CloudFront network. The mirror servers in the network periodically check back with the data origin, and if the content cached on the mirror is out of date, it refreshes from the origin. By default it takes a change on the origin server 1 day to propagate through the CDN, but this can be reduced or decreased using Expires and Cache-Control HTTP headers.

Logs for CloudFront distributions must be stored on S3. Consequently, so set up the MathJax CDN, we also had to set up an S3 account. S3 accounts work by defining "buckets" which are like virtual file systems. We have a single mathjax-logs bucket, and the logs go there, one an hour. TODO: devise a process to process and delete old logs, so they don't pile up and start costing real money.

Each CloudFront distribution is set up either through the AWS Management Console (a web interface) or via a management script. When we set ours up, the AWS web console didn't yet facilitate creating distributions with custom origins.

This page documents how we set up our current distribution. We used cfcurl.pl - a wrapper for curl that handles authentication for Amazon Web Services.

Other management tools are available, including:

  • cfcmd - a Java based command-line tool
  • CloudBerry S3 Explorer - A freeware Windows program with GUI
  • Cyberduck - An open-source program for OS-X or Windows. Managing Cloudfront is possible, but a bit obscure.

Installing cfcurl.pl

Instructions for the utility are found at http://aws.amazon.com/code/developertools/1878.

Download

There is a Download link on the page. If you click on this it will probably display the script in the browser. Save the script to a file called cfcurl.pl. Make sure the file is executable (chmod a+x, etc).

Prerequisites

cfcurl.pl has several prerequisite PERL modules which are listed on the instruction page. If these are not already on your system they will need to be installed.

Authentication

You will need your Access Key ID and Secret Access Key. These can be found by signing into your account and navigating to Security Credentials. Create a file in your home directory called .aws-secrets. This file must be readable only by you (chmod 600). The contents of the file should be something like:

%awsSecretAccessKeys = (
    # primary account
    'primary' => { # this is the key-friendly-name
        id => '1ME55KNV6SBTR7EXG0R2', # change to the Access Key ID
        key => 'zyMrlZUKeG9UcYpwzlPko/+Ciu0K2co0duRM3fhi', # change to the Secret Access Key
    },
);

You are now ready to manage distributions. A few cfcurl.pl usage examples are printed when you call the script with no arguments.

Creating the Distribution

Note: This can now be done from the AWS Web Console. Follow the instructions in the announcement.

These instructions are based on the Cloudfront POST Distribution page. We created a file called create_request.xml, with this configuration info:

<?xml version="1.0" encoding="UTF-8"?>
<DistributionConfig xmlns="http://cloudfront.amazonaws.com/doc/2010-11-01/">
   <CustomOrigin>
      <DNSName>dist.mathjax.org</DNSName>
      <HTTPPort>80</HTTPPort>
      <OriginProtocolPolicy>http-only</OriginProtocolPolicy>
   </CustomOrigin>
   <CallerReference>your unique caller reference</CallerReference>
   <CNAME>cdn.mathjax.org</CNAME>
   <Comment>My comments</Comment>
   <Enabled>true</Enabled>
   <DefaultRootObject>index.html</DefaultRootObject>
   <Logging>
      <Bucket>mathjax-log.s3.amazonaws.com</Bucket>
      <Prefix>mathjax/</Prefix>
   </Logging>
</DistributionConfig>

Run the following from the command-line: cfcurl.pl --keyname primary -- -X POST -H "Content-Type: text/xml; charset=UTF-8" --upload-file create_request.xml https://cloudfront.amazonaws.com/2010-11-01/distribution > create_response.xml

Check that create_response.xml indicates successful creation, or use the web console. Both of these will also indicate the DomainName that can be used to access the distribution.

DNS Entries

To set up DNS entries that forward requests from cdn.mathjax.org to the cloudfront distribution, we added a CNAME record for our mathjax.org host.

cdn.mathjax.org CNAME d3eoax9i5htok0.cloudfront.net

It was also necessary to use the AWS console to edit the distribution info, to include the cdn.mathjax.org CNAME there as well.

Content Distribution

For more details see the Cloudfront documentation on Distribution of New Content, Object Expiration and Object Eviction.

TODO: info on HTTP headers - which ones impact on Cloudfront behavior; which ones are set by CF.

CloudFront relies on a pull paradigm to distribute files to edge-locations (proxies). Files will not be requested from the origin server until a client-request has been received by the proxy.

Additionally, proxies do not cache the responses from conditional requests (requests that contains headers such as If-Modified-Since, If-Unmodified-Since, If-Match, If-None-Match, or If-Range), although they will respond to conditional requests with cached files if available. If a conditional request arrives for an uncached file, the proxy will forward the request to the origin server and deliver the response to the client without caching it.

Proxies honor the HTTP Expires and Cache-Control headers. When a file in the cache is stale (and a client-request has been received) the proxy will check with the origin server whether the file is up-to-date. If the file has not been modified then the proxy will mark its cached version as fresh. In this case no HTTP headers are updated, even if the origin server specifies different headers. (Don't know if this behavior conforms to the standard for HTTP.)

Proxies also honor the s-maxage setting in Cache-Control headers. This allows a different max-age to be set for proxy caches.

NOTE: Undocumented behavior (could be wrong explanation) When a client-request arrives for a file that is in the cache but is stale, proxies respond to the request with the stale file before going back to the origin server to check if the file is modified. The response to the client is tagged with the HTTP header X-Cache: Refresh-Hit from Cloudfront.

Infrequently requested files may evicted from the proxy-cache to make space for more popular files.

Implications

  • After a directory has been updated on the origin-server, it is possible for there to be a mix of old and new content at the proxies. If the max-age of content is one day then this mix will last for up to one day. Potentially longer if there are hardly any client-requests to trigger a file refresh. (This condition also applies to files in browser-cache)
  • When updating HTTP headers the proxies will not take-up the modifications unless files have their Last-Modified time-stamp updated (as though the file has also been updated).

Purging files

Note: Purging is only recommended for unexpected circumstances.

These instructions are based on the Cloudfront Object Invalidation and POST Invalidation pages.

To purge the latest release from edge servers we create a file called purge_request.xml, with contents like:

<?xml version="1.0" encoding="UTF-8"?>
<InvalidationBatch>
        <CallerReference>20110316141307</CallerReference>
        <Path>/mathjax/latest/COPYING.txt</Path>
        <Path>/mathjax/latest/MathJax.js</Path>
        ...
</InvalidationBatch>

This has a <Path> entry for every file that needs to be purged, and requires a unique <CallerReference> each time a purge request is made.

Run the following from the command-line:

cfcurl.pl --keyname primary -- -X POST -H "Content-Type: text/xml; charset=UTF-8" --upload-file purge_request.xml https://cloudfront.amazonaws.com/2010-11-01/distribution/EYNHAAPB3O40Q/invalidation > purge_response.xml

Check that purge_response.xml indicates a successful request. If the keyname or distribution-id is wrong this might result in an Access Denied error. If the CallerReference is a duplicate of a previous purge request the response will declare the status of the request is completed. For this reason it is suggested to use a time-stamp for this field.

Generating a Purge Request file

The MathJax distribution has around 30,000 files, so creating a purge request by hand would be tedious. Additionally, purge requests are limited to 1000 objects. However the fonts/ directory changes very rarely, so excluding it from purge requests would seem reasonable and brings the file count to just below 1000.

The following PERL script can be used to create a purge-request file when given the mathjax release name (latest, 1.1-latest, etc) and the associated source directory on the local machine. The script excludes the fonts/ and .git/ directories if they exist.

#!/usr/bin/perl

use File::Basename;
use File::Find;
use POSIX qw(strftime);

$usage = <<"";
Usage:
$0 mathjax-release-name srcdir
mathjax-release-name is the mathjax sub-directory on cloudfront
srcdir is the associated mathjax source directory on this machine

$VERSION = shift or die $usage;
$SRCDIR = shift or die $usage;
$CALLER_REF = strftime("%Y%m%d%H%M%S", localtime);

($fname, $abs_srcdir) = fileparse("$SRCDIR/");
chop $abs_srcdir;

sub handler {
 my $path = shift;
 print <<"";
 <Path>/mathjax/${VERSION}/${path}</Path>

}

sub wanted { # Reject non-files, and anything in .git or fonts dirs
 $_ = $File::Find::name;
 -f or return;
 s/^${abs_srcdir}\///;
 /^.git\// and return;
 /^fonts\// and return;
 handler($_); 
}

# Generate output
print <<"";
<?xml version="1.0" encoding="UTF-8"?>
<InvalidationBatch>
 <CallerReference>${CALLER_REF}</CallerReference>

find(\&wanted, $abs_srcdir);

print <<"";
</InvalidationBatch>

Call this script with:

./make_purge_request.pl latest /path/to/mathjax > purge_request.xml

to create the purge_request.xml file used previously.

HTTPS usage

Cloudfront supports access via HTTPS (Secure HTTP).

For HTTPS requests Cloudfront also uses HTTPS to request the file from the origin server. This means that the origin server must support HTTPS requests. NOTE: The Cloudfront documentation must be wrong on this point because the MathJax origin server is dist.mathjax.org which isn't properly configured for HTTPS (e.g. https://dist.mathjax.org/mathjax/latest/test/sample.html). But accessing the Cloudfront site seems okay (e.g.https://d3eoax9i5htok0.cloudfront.net/mathjax/latest/test/sample.html).

Cloudfront does NOT support CNAME with HTTPS (e.g. https://cdn.mathjax.org/mathjax/latest/test/sample.html). Perhaps Amazon is waiting for Windows XP usage to decline, as Internet Explorer doesn't support this on platforms older than Vista. The MathJax CDN may be accessed via HTTPS using the Cloudfront domain-name (see preceding paragraph).

Clone this wiki locally