Skip to content

rneher/augur

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

81 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Augur

Augur is Python package to track (and eventually forecast) flu evolution. It currently

  • imports public sequence data
  • subsamples, cleans and aligns sequences
  • builds a phylogenetic tree from this data

The program is live on Amazon EC2 with results pushed to Amazon S3. The latest JSON-formatted flu tree is available as tree_streamline.json. This tree is visualized at blab.github.io/auspice/.

Run

You can run across platforms using Docker. An image is up on the Docker hub repository as trvrb/augur. With this public image, you can immediately run augur with

docker pull trvrb/augur
docker run -ti -e "GISAID_USER=$GISAID_USER" -e "GISAID_PASS=$GISAID_PASS" -e "S3_KEY=$S3_KEY" -e "S3_SECRET=$S3_SECRET" -e "S3_BUCKET=$S3_BUCKET" --privileged trvrb/augur

This starts up Supervisor to keep augur and helper programs running. This uses supervisord.conf as a control file.

To run augur, you will need a GISAID account (to pull sequences) and an Amazon S3 account (to push results). Account information is stored in environment variables:

  • GISAID_USER: GISAID user name
  • GISAID_PASS: GISAID password
  • S3_KEY: Amazon S3 key
  • S3_SECRET: Amazon S3 secret
  • S3_BUCKET: Amazon S3 bucket

Develop

Full dependency information can be seen in the Dockerfile. To run locally, pull the docker image with

docker pull trvrb/augur

And start up a bash session with

docker run -ti -e "GISAID_USER=$GISAID_USER" -e "GISAID_PASS=$GISAID_PASS" trvrb/augur /bin/bash

From here, the build pipeline can be run with

python augur/run.py

Pipeline notes

Virus ingest, alignment and filtering

Using Selenium to automate downloads from GISAID. GISAID requires login access. User credentials are stored in the ENV as GISAID_USER and GISAID_PASS.

Keeps viruses with full HA1 sequences, fully specified dates, cell passage and only one sequence per strain name. Subsamples to 100 sequences per month for the last 3 years before present.

Align sequences with mafft. Testing showed a much lower memory footprint than muscle.

Keep only sequences that have the full 1701 bases of HA in the alignment.

Tree processing

Using FastTree to get a starting tree. FastTree will build a tree for ~5000 sequences in a few minutes. Then using RAxML to refine this initial tree. A full RAxML run on a tree with ~5000 sequences could take days or weeks, so instead RAxML is run for a fixed 1 hour and the best tree found during this search is kept. This will always improve on FastTree.

Reroot the tree based on outgroup strain, collapse nodes with zero-length branches and ladderize the tree.

About

Flu divination

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published