AgrVATE

Agr Variant Assessment & Typing Engine

AgrVATE is a tool for rapid identification of Staphylococcus aureus agr locus type and also reports possible variants in the agr operon.

WORKFLOW:

AgrVATE accepts a S. aureus genome assembly as input and performs a kmer search using an Agr-group specific kmer database to assign the Agr-group. The agr operon is then extracted using in-silico PCR and variants are called using an Agr-group specific reference operon.

Citation

Please cite the following paper if you use AgrVATE in your research. Thank you!

Raghuram V, Alexander AM, Loo HQ, Petit RA 3rd, Goldberg JB, Read TD. Species-Wide Phylogenomics of the Staphylococcus aureus Agr Operon Revealed Convergent Evolution of Frameshift Mutations. Microbiol Spectr. 2022 Jan 19;10(1):e0133421. doi: 10.1128/spectrum.01334-21. Epub ahead of print. PMID: 35044202; PMCID: PMC8768832.

INSTALLATION:

Please see the PREREQUISITES section for all the tools required to run AgrVATE. For ease of use, I recommended you install AgrVATE using Conda.

conda create -n agrvate -c bioconda agrvate
conda activate agrvate

This will install all necessary dependencies EXCEPT Usearch. Due to Usearch's license, it cannot be provided with the conda installation. Please download and extract usearch11.0.667 (osx32 or linux32) from here and add it to your PATH

For example (Use the version appropriate for your operating system):

curl "https://www.drive5.com/downloads/usearch11.0.667_i86linux32.gz" --output usearch11.0.667_i86linux32.gz #Downloads usearch binary

gunzip usearch11.0.667_i86linux32.gz #Decompresses usearch binary

chmod 755 usearch11.0.667_i86linux32 #Changes permissions to executable

cp ./usearch11.0.667_i86linux32 $(dirname "$(which agrvate)") #Copies usearch binary to the same directory as agrvate

NOTE: Currently, only the 32-bit version of usearch is free to use. This version is not supported by WSL or MacOS (post-Catalina). Therefore, it is recommended to use AgrVATE on Linux machines or older versions MacOS. If you are unable to run usearch, use the -m option to run MUMmer instead (IN BETA). However, please note that if there are large insertions/deletions in the agr-operon, MUMmer can split the alignment into 2 and the resulting extracted agr-operon will not be intact, in which case frameshift detection using snippy may miss these indels.

PREREQUISITES:

Usearch 32 bit linux
Robert C. Edgar, Search and clustering orders of magnitude faster than BLAST, Bioinformatics, Volume 26, Issue 19, 1 October 2010, Pages 2460–2461, https://doi.org/10.1093/bioinformatics/btq461
NCBI blast+
Camacho, C., Coulouris, G., Avagyan, V. et al. BLAST+: architecture and applications. BMC Bioinformatics 10, 421 (2009). https://doi.org/10.1186/1471-2105-10-421
Snippy
Seemann T (2015). Snippy: fast bacterial variant calling from NGS reads. https://github.com/tseemann/snippy
MUMmer
S. Kurtz. et al (2004). Versatile and open software for comparing large genomes. Genome Biology, R12. https://doi.org/10.1186/gb-2004-5-2-r12
HMMER
S.R. Eddy. Biological sequence analysis using profile hidden Markov models. http://hmmer.org/
SeqKit
Shen W, Le S, Li Y, Hu F (2016) SeqKit: A Cross-Platform and Ultrafast Toolkit for FASTA/Q File Manipulation. PLoS ONE 11(10): e0163962. https://doi.org/10.1371/journal.pone.0163962

Databases folder for agr group typing and variant calling

DREME
DREME is not required for AgrVATE but it was used to build the kmer database for Agr-group typing (gp1234_motifs_all.fasta)
Timothy L. Bailey, DREME: motif discovery in transcription factor ChIP-seq data, Bioinformatics, Volume 27, Issue 12, 15 June 2011, Pages 1653–1659, https://doi.org/10.1093/bioinformatics/btr261

 agrvate_databases/
 	├── agrD_hmm.hmm
 	├── agrD_hmm.hmm.h3f
 	├── agrD_hmm.hmm.h3i
 	├── agrD_hmm.hmm.h3m
 	├── agrD_hmm.hmm.h3p
 	├── agr_operon_primers.fa
 	├── gp1234_motifs_all.fasta
 	└── references
 		├── gp1-operon_ref.gbk
 		├── gp2-operon_ref.gbk
 		├── gp3-operon_ref.gbk
 		└── gp4-operon_ref.gbk
 		└── mummer_ref_operon.fna

USAGE:

agrvate -i filename.fasta [options]

FLAGS:
- -i Input S. aureus genome in FASTA format [alternate: --input]
- -t Does agr typing only (skips agr operon extraction and frameshift detection) [alternate: --typing-only]
- -m Uses MUMmer dnadiff instead of usearch [alternate: --mummer]
- -f Force overwrite existing results directory [alternate: --force]
- -d Path to agrvate_databases (Not required if installed using Conda) [alternate: --databases]
- -h Print this help message and exit [alternate: --help]
- -v Print version and exit [alternate: --version]

AgrVATE supports a single FASTA file as input, but the file can be a multi-fasta file. To run multiple S. aureus genomes, it is recommended to keep them as separate files in a common directory.
For example:

ls fasta_files/* | xargs -I {} agrvate -i {} [options]

OUTPUTS:

RESULTS:

A new directory with suffix -results will be created where all the following files can be found

NOTE: There are 15 possible kmers for each agr group per genome. The analyses will continue even if only one kmer matches a given agr-group but it should be noted that < 5 kmers matching leads to a low confidence agr-group call. Col 3 in fasta-summary.tab shows the number of kmers matched

fasta-summary.tab:

  col 1: Filename
  col 2: Agr group (gp1/gp2/gp3/gp4). 'u' means unknown. If multiple agr groups were found (col 5 = m), the displayed agr group is the majority/highest confidence. 
  col 3: Match score for agr group (maximum 15; 0 means untypeable; < 5 means low confidence)
  col 4: Canonical or non-canonical agrD ( 1 means canonical; 0 means non-canonical; u means unknown)
  col 5: If multiple agr groups were found, likely due to multiple S. aureus isolates in sequence ( s means single, m means multiple, u means unknown )
  col 6: Number of frameshifts found in CDS of extracted agr operon ( Column is 'u' if agr operon was not extracted )

If multiple assemblies are run, use this command from parent directory to output a consolidated summary table for all samples

  awk 'FNR==1 && NR!=1 { while (/^#/) getline; } 1 {print}' ./*-results/*-summary.tab > filename.tab

fasta-agr_gp.tab:

  col 1: Assembly Contig ID
  col 2: ID of matched agr group kmer
  col 3: evalue
  col 4: Percentage identity of match
  col 5: Start position of kmer alignment on input sequence
  col 6: End position of kmer alignment on input sequence

fasta-agr_operon_frameshifts.tab:
Frameshift mutations in CDS of extracted agr operon detected by Snippy. An agr-group specific reference sequence is used to call variants.
```
  col 1: Filename
  col 2: Position on agr operon compared to reference
  col 3: Type of frameshift
  col 4: Effect of mutation
  col 5: Gene
```
fasta-blastn-log.txt:
Standard output of ncbi blastn
fasta-agr_operon.fna:
Agr operon extracted from in-silico PCR using USEARCH -SEARCH_PCR in fasta format
fasta-hmm.tab:
Tabular output of nhmmer This file is present only if the agr group is untypeable.
fasta-hmm-log.txt:
Standard output of nhmmer This file is present only if the agr group is untypeable.
fasta-pcr-log.tab:
Standard output of USEARCH -SEARCH_PCR
fasta-snippy_log.txt:
Standard output of Snippy
fasta-snippy/
All output files of Snippy
fasta-mummer_log.txt:
Standard output of MUMmer dnadiff
fasta-mummer/
All output files of MUMmer dnadiff

TROUBLESHOOTING

An error report summary file with suffix -error-report.tab will be created in the working directory.

The error report file does not contain any results. It merely shows which steps of the process pipeline ran (pass) and which steps did not (fail).

pass Does not necessarily mean a result was obtained, it only means the step completed successfully.
fail Does not necessarily mean there was an error, it only means that step was not performed. However, possible causes of error for each column are mentioned below.

The columns are ordered by how the processes are carried out. i.e col 1 is the first step and col 7 is the last. If one column shows fail it means the programme exited at that step and therefore the remaining columns will also show fail .

error-report.tab:

  col 1: Input name - the argument supplied to the -i flag
  col 2: Input check - If fail, the input did not pass the valid fasta file criteria
  col 3: Databases check - If fail, the databases folder or the path to the databases was not valid. 
  col 4: Outdir check - If fail, the results directory already exists and couldn't be overwritten. Use flag -f or --force. 
  col 5: Agr typing - If fail, the Agr typing kmer search could not be performed. Check if blastn is installed correctly. 
  col 6: Operon check - If fail, in-silico PCR was not performed by usearch or agr operon search was not performed by mummer. Check if usearch/mummer is installed correctly. 
  col 7: Snippy check - If fail, agr operon frameshift detection was not performed. Check if snippy is installed correctly.

If multiple assemblies are run, use this command from parent directory to output a consolidated report table for all samples

  awk 'FNR==1 && NR!=1 { while (/^#/) getline; } 1 {print}' ./*-error-report.tab > filename.tab

Author

Vishnu Raghuram

Name		Name	Last commit message	Last commit date
Latest commit History 90 Commits
Manuscript_figs		Manuscript_figs
agrvate_databases		agrvate_databases
LICENSE		LICENSE
README.md		README.md
agrvate		agrvate
agrvate_workflow.png		agrvate_workflow.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Manuscript_figs

Manuscript_figs

agrvate_databases

agrvate_databases

LICENSE

LICENSE

README.md

README.md

agrvate

agrvate

agrvate_workflow.png

agrvate_workflow.png

Repository files navigation

AgrVATE

Agr Variant Assessment & Typing Engine

AgrVATE is a tool for rapid identification of Staphylococcus aureus agr locus type and also reports possible variants in the agr operon.

WORKFLOW:

Citation

INSTALLATION:

PREREQUISITES:

USAGE:

OUTPUTS:

RESULTS:

TROUBLESHOOTING

Author

About

Releases 4

Packages

Contributors 2

Languages

License

VishnuRaghuram94/AgrVATE

Folders and files

Latest commit

History

Repository files navigation

AgrVATE

Agr Variant Assessment & Typing Engine

AgrVATE is a tool for rapid identification of Staphylococcus aureus agr locus type and also reports possible variants in the agr operon.

WORKFLOW:

Citation

INSTALLATION:

PREREQUISITES:

USAGE:

OUTPUTS:

RESULTS:

TROUBLESHOOTING

Author

About

Topics

Resources

License

Stars

Watchers

Forks

Languages