Energy Sparks data pipeline

Data can be imported into Energy Sparks via files sent to specific email addresses. Files attached to or linked to in emails can be either csv files, xls or xlsx spreadsheets or zip files containing one or more of these types of files.

The purpose of the pipeline is to convert these files to csv format, to ultimately be added to the AMR data bucket, prefixed by the local part of the email address they were sent to. There is an overnight batch job which takes data from the AMR data bucket for importing to EnergySparks.

Buckets and functions

The functions process files in the inbox bucket until they're converted into csv in the data bucket:

bucket: es-[env]-data-inbox
| v function: unpack attachments | v bucket: es-[env]-data-process --------------> function: process file ^ /|\ | / | \
| ---------------------------------- | ------------------------------------ | | | | | v v v | zip files xls/xlsx files csv files | | | | | v v v | bucket: es-[env]-data-uncompressed bucket: es-[env]-data-spreadsheet bucket: es-[env]-data-amr-data
| | |
| v v
| function: uncompress file function: convert file | | | --------------------------------------------------

There are a series of buckets which the files move through during processing. An AWS lambda function is triggered by the addition of a file to a bucket, which then processes the file and moves it onto another bucket which in turn triggers more lambdas where required.

The buckets we have are as follows and there is a set of these for each of the development, test and production environments:

es-[env]-data-inbox

Written to by SES email ruleset. Contains MIME formatted files. Triggers running the “unpack attachments” function which will take attachments and put them in the process bucket.

es-[env]-data-process

Contains files from emails that have been unpacked by the previous function. Triggers running the "process file" function, which will put csv files in the AMR data bucket, zip files in the uncompressed bucket or spreadsheets in the spreadsheet bucket. Any unrecognised files are put in the unprocessable bucket.

es-[env]-data-uncompressed

Contains files that need to be unzipped. Triggers the "uncompress file" function, which unzips files and puts them in the process bucket. Unregognised files are put in the unprocessable bucket.

es-[env]-data-spreadsheet

Contains spreadsheets that need to be converted to csv. Triggers the "convert file" function which converts xls and xlsx spreadsheet files to csv and puts them in the process bucket. Unrecognised files are put in the unprocessable bucket.

es-[env]-data-unprocessable

Contains files that cannot be processed, e.g. unknown formats, zips that couldn’t be parsed or spreadsheets that couldn't be converted to csv.

es-[env]-data-amr-data

Contains CSV files ready for processing by the overnight batch job. Within this bucket, folders called archive-* are archived versions of processed files.

Development and testing

Run bundle install to install the required gems locally.

Run bundle exec rspec to run the test suite. The tests stub calls to the S3 service to monitor requests made and to fake responses.

Run bundle exec guard to run tests automatically as files change.

Serverless

The setup of the buckets, lambdas and associated permissions is managed by the serverless framework which creates and updates a CloudFormation stack on AWS.

Serverless allows us to set a 'stage' and run multiple environments (e.g. test, production).

Serverless automatically creates the S3 buckets that are directly attached to lambda functions in the functions: definitions. S3 buckets that are not directly attached to lambda functions are specified in the resources: section along with an S3 policy that allows SES to add to the inbox bucket.

The following instructions assume you are working from the project root directory. Note, the region is set manually in the serverless.yml file so deploying to different regions would require a change to the configuration.

Deployment configuration

Install serverless using homebrew (brew install serverless) or using npm. We are using serverless v3 which requires a version of node greater than 10.

Serverless plugins need to be installed with npm install

Add the serverless AWS credentials to a profile called serverless in your ~/.aws/credentials file (these credentials can be found in a document titled 'Serverless AWS credentials' in 1password):

[serverless]
aws_access_key_id = YOURKEYHERE123
aws_secret_access_key = YOURSECRETHERE123

The functions log some errors in Rollbar. You need to add the following files:

.env.development
.env.test
.env.production

In each file, add a ROLLBAR_ACCESS_TOKEN environment variable - for test and production, use the same token as the equivalent environments in the live main application and for development, use the test environment token.

Make sure you have docker installed (for macOS brew install --cask docker or download the Apple Chip version).

Deployment

As we are using ruby gems with native dependencies (i.e. Nokogiri), we use Docker to build the Gems in an environment that is the same as AWS Lambda. These can then be published to a lambda layer which is used by the functions.

To build the packages with docker but without deploying, run sls package.

If there is a permissions error when running docker, you may have to add your user to the docker group: sudo usermod -aG docker $USER.

Run sls deploy --stage ENVIRONMENT to build the gems with docker and deploy the pipeline to AWS. e.g. sls deploy --stage development.

Running sls deploy manually will deploy the development stage by default.

To deploy to a different stage use the --stage option e.g. sls deploy --stage test.

Monitoring

Logs and usage stats found via the Monitoring tab on the individual lambda AWS page.

Other configuration

Adding a new school area

The email rule for SES is a catch-all and will use the local part of the email address to prefix the S3 object key. e.g. a file called import.csv sent to sheffield@test.com will have the S3 key sheffield/import.csv. Changes will need to be made to the main application to process files from previously unseen prefixes.

Adding a stage

To start receiving emails to a new stage a new SES rule will have to be added to move the email to the es-STAGE-data-pipeline-inbox bucket.

File expiry

File expiry is managed manually through the S3 web interface and will need setting up for new buckets. This is done with lifecycle rules, configurable via the Management tab for the bucket.

Name		Name	Last commit message	Last commit date
Latest commit History 58 Commits
.github/workflows		.github/workflows
handlers		handlers
spec		spec
.gitignore		.gitignore
.overcommit.yml		.overcommit.yml
.rubocop.yml		.rubocop.yml
.rubocop_todo.yml		.rubocop_todo.yml
Gemfile		Gemfile
Gemfile.lock		Gemfile.lock
Guardfile		Guardfile
README.md		README.md
handler.rb		handler.rb
manual_upload_test.rb		manual_upload_test.rb
package-lock.json		package-lock.json
package.json		package.json
serverless.yml		serverless.yml

Energy-Sparks/energy-sparks-data-pipeline

Folders and files

Latest commit

History

Repository files navigation

Energy Sparks data pipeline

Buckets and functions

Development and testing

Serverless

Deployment configuration

Deployment

Monitoring

Other configuration

Adding a new school area

Adding a stage

File expiry

Further reading

Deployment of gems with native extensions

About

Resources

Stars

Watchers

Forks

Languages