Skip to content

egonozer/AGEnt

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

AGEnt

INTRODUCTION:

AGEnt is a program for identifying accessory genomic elements in bacterial genomes by using an in-silico subtractive hybridization approach against a core genome, such as those generated by the Spine algorithm. See http://vfsmspineagent.fsm.northwestern.edu for more information on Spine.

REQUIREMENTS:

  • Perl 5.10 or above
  • MUMmer version 3.22 or above. Install MUMmer as directed by the instructions included with the software.
  • Mac OSX or Linux. We provide no guarantees that this will work on Windows or other operating systems.

INSTALLATION:

Simply move the AGEnt directory to the desired location. The "scripts" directory must remain in the same directory as AGEnt.pl

USAGE:

Basic command: perl AGEnt.pl -r core_genome.fasta -q query_genome.fasta

For list of options, call the script without any inputs: perl AGEnt.pl

Required Inputs:

-q
File of query sequence(s) in Fasta or Genbank format. If an annotated Genbank formatted file is used, AGEnt will try to extract CDS coordinates to separate genes into core and accessory groups.

AGEnt will try to guess what type of file you have entered based on the suffix (Fasta if suffix is .fasta or .fa, Genbank if suffix is .gbk or .gb). If you would like to set this manually, use the -Q option (see below).

If a Genbank file is given, the CDS records must have "locus_id" tags for gene information to be extracted. Some automated annotation pipelines, such as RAST, do not add "locus_id" tags. These can be added to your file using the online or downloadable program gbk_reformat (see http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/gbk_reformat.cgi for more information).

If your genome is split across multiple files (i.e. multiple chromosomes or plasmids), you can include all files here separated by commas (no spaces). File order is not important. All files must be in the same format, i.e. all fasta or all genbank. No mixing and matching!
Example:

-q /path/to/chrom_I.fasta,/path/to/chrom_II.fasta,/path/to/plasmid.fasta

-r
File of core / reference sequence(s) in Fasta or Genbank format.

AGEnt will try to guess what type of file you have entered based on the suffix (Fasta if suffix is .fasta or .fa, Genbank if suffix is .gbk or .gb). If you would like to set this manually, use the -R option (see below).

If you want to use the core genome output produced by Spine, use the "backbone.fasta" file here.

Optional Inputs:

-b
Also output core (i.e. non-accessory) sequences and coordinates. (default: only accessory sequences and coordinates are output)

-c
Path to file containing names and coordinates of genes in the query genome. This will output a file separating genes into core or accessory categories.

Default file format is "Glimmer" format, i.e.

	>contig_name_1 
	orf_ID_1<space(s)>start_coordinate<space(s)>stop_coordinate
	orf_ID_2<space(s)>start_coordinate<space(s)>stop_coordinate
	>contig_name_2
	orf_ID_3<space(s)>start_coordinate<space(s)>stop_coordinate
	etc...

but different file formats can be chosen with option -f (see below)

Contig names should match those in file given by option -q.
All coordinates are 1-based.
Coordinates assuming a circular contig that cross the origin will give incorrect results.
Best if all ORF IDs are unique (i.e. don't restart count every contig).
If an annotated Genbank file is given as the query sequence (-q), gene coordinates entered here will override the Genbank file annotations.

If you provided multiple sequence files as query files (-q), you can also provide multiple coordinate files here, again separated by commas with no spaces. File order is not important. All files must be in the same format, i.e. all gff3 or all glimmer, etc. No mixing and matching!
Example:

-c /path/to/chrom_I.gff3,/path/to/chrom_II.gff3,/path/to/plasmid.gff3

-f
format of ORF coordinate file given to -c. Options are:

  • 'glimmer'
  • 'genemark'
  • 'prodigal' ('gbk' or web format)
  • 'gff' (accepts gff3 or gff formatted files)

(default: glimmer)

-l
Print license information and quit

-m
Minimum alignment identity between query and reference to be called core, in percent.
(default: 85)

-n
Full path to folder containing MUMmer scripts and executables, i.e. /home/applications/MUMmer/bin
(default: tries to find MUMmer in your PATH)

-o
Prefix for output files.
(default: "output")

-p
Prefix for output sequences.
(default: same as given by option -o)

-Q
Manual override of query file type. Enter "F" for Fasta or "G" for Genbank.

-R
Manual override of reference file type. Enter "F" for Fasta or "G" for Genbank.

-s
Minimum size of fragments to output, in bases.
(default: 10)

-v
Print version information and quit.

OUTPUT FILES:

statistics.txt
First line shows the current software version used.
Second line shows the input parameters given to the software.
Column headers and descriptions:

  • source: Indicates whether the row describes the strain's accessory or core genome
  • total_bp: Sequence size, in bases
  • gc_%: Percent GC content of the sequence
  • num_segs: Number of separate sequence segements output
  • min_seg: Smallest segment size, in bases
  • max_seg: Largest segment size, in bases
  • avg_leng: Average length of the output segments
  • median_leng: Median length of the output segments
  • num_cds (if annotation was provided): number of coding sequences present. A coding sequence is counted as present within either the core or the accessory genome if 50% or greater of the length of coding sequence is found in sequences within that genome fraction.

coords.txt
Coordinates of genome sequences.
"*.accessory_coords.txt": Accessory genome sequences for the indicated strain
"*.core_coords.txt": Core genome sequences for the indicated strain [optional: see -b above] Column headers and descriptions:

  • contig_id: sequence ID of the source strain contig
  • contig_length: length, in bases, of the source strain contig
  • start: start coordinate of the genome segement on the source strain contig
  • stop: stop coordinate of the genome segment on the source strain contig
  • out_seq_id: sequence ID of the segment as found in the corresponding sequence file output by Spine

*.fasta
Nucleotide sequences of the genome segments output by AGEnt. Original sources of the sequences can be determined by cross-referencing the sequence IDs with the cooresponding coords.txt file

loci.txt (if annotated genbank file or annotation coordinate file was provided for the query genome)
List of coding sequences found in the core genome.
"*.accessory_loci.txt": Accessory genome coding sequences for the indicated strain
"*.core_loci.txt": Core genome coding sequences for the indicated strain [optional: see -b above]
Column headers and descriptions:

  • locus_id: locusID of gene
  • gen_contig_id: Source strain contig ID
  • gen_contig_start: Gene start coordinate in source sequence (1-based)
  • gen_contig_stop: Gene stop coordinate in source sequence (1-based)
  • strand: Strand on which the gene is encoded
  • out_seq_id": Output sequence ID (corresponds to sequence IDs in corresponding sequence fasta file above)
  • out_seq_start: Gene start coordinate in output sequence (1-based)
  • out_seq_stop: Gene stop coordinate in output sequence (1-based)
  • pct_locus: Percentage of gene represented in the output sequence
  • overhangs: Number of bases of the gene missing from the end(s) of the output segment. Values are separated by a comma. First value is the number of bases missing from the 5' end of the output segement, second value is the number of bases missing from the 3' end of the output segment.
  • product: Gene product

LICENSE:

AGEnt Copyright (C) 2016-2018 Egon A. Ozer

This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program. See LICENSE.txt

CONTACT:

Contact Egon Ozer with questions or comments

About

Identification of nucleotide accessory genomic elements in bacterial or other small genome organisms

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages