Skip to content

Gene Prediction Modes

hyattpd edited this page Aug 11, 2014 · 13 revisions

Prodigal provides support for three modes of gene prediction.

  • Normal Mode, in which Prodigal takes the sequence you provide it, studies it, learns its properties, and then predicts genes based on those properties.
  • Anonymous Mode, in which Prodigal applies pre-calculated training files to the provided input sequence and predicts genes based on the best results.
  • Training Mode, which works like normal mode, but Prodigal saves a training file for future use.

Which of these methods is right for your sequence depends on what type of data set you are analyzing. Normal mode should be used whenever you have sufficient data for Prodigal to train on (100kb+ for good 3' predictions, 500kb+ for good 5' predictions, are safe numbers, though you may be able to get by with less). Anonymous mode should be used on metagenomic data sets, or on sequences too short to provide good training data.

Normal mode should be used on finished genomes, reasonable quality draft genomes, and big viruses. Anonymous mode should be used on metagenomes, low quality draft genomes, small viruses, and small plasmids. Training mode, the third option, works like normal mode, but outputs a training file which can be used for later analysis. This is useful primarily for when you wish to train on a different sequence than the one you wish to analyze.

A summary from the Prodigal help description:

  -p, --mode:           Specify mode (normal, train, or anon).
                          normal:   Single genome, any number of
                                    sequences.  (Default)
                          train:    Do only training.  Input
                                    should be multiple FASTA of
                                    one or more closely
                                    related genomes.
                          anon:     Anonymous sequences, analyze
                                    using preset training files,
                                    ideal for metagenomic data
                                    or single short sequences.
                          meta:     (Deprecated) Same as anon.
                          single:   (Deprecated) Same as normal.

The modes are explained in more detail below.

Normal Mode

To run Prodigal in normal mode on a single or multiple FASTA input sequence, do:

$ prodigal -i my.genome.fna -o gene.coords.gbk -a protein.translations.faa

The -i option specifies the input file, which can be either single/multiple FASTA, Genbank, or EMBL format. The Genbank and EMBL parsers are unsophisticated and have not been thoroughly tested, however, so we recommend using FASTA whenever possible.

The -o option specifies the output file (gene coordinates), and the -a option specifies where to write the protein translations. Protein translations are optional, but most users want this information by default.

With no input or output files specified, Prodigal reads from stdin and writes to stdout. So the following works also:

$ prodigal < my.genome.fna > gene.coords.gbk

The following options specify input and output files for Prodigal:

Input/Output Parameters

  -i, --input_file:     Specify input file (default stdin).
  -o, --output_file:    Specify output file (default stdout).
  -a, --protein_file:   Specify protein translations file.
  -d, --mrna_file:      Specify nucleotide sequences file.
  -s, --start_file:     Specify complete starts file.
  -w, --summ_file:      Specify summary statistics file.
  -f, --output_format:  Specify output format.
                          gbk:  Genbank-like format (Default)
                          gff:  GFF format
                          sqn:  Sequin feature table format
                          sco:  Simple coordinate output
  -q, --quiet:          Run quietly (suppress logging output).

The various output options and formats are explained in more detail in Understanding the Prodigal Output.

Anonymous Mode

To run Prodigal in anonymous mode, just add a '-p anon' option to the command line, i.e.

$ prodigal -i metagenome.fna -o coords.gbk -a proteins.faa -p anon

Prodigal 2.x: In older versions of Prodigal, this mode was known as metagenomic mode, and was invoked using the -p meta option. Prodigal 3.x will continue to support this syntax, although -p meta is considered deprecated. We chose to rename this mode to anonymous mode since this more accurately describes its function, which is not solely limited to metagenomic sequences.

Training Mode

Prodigal also has a training mode, which can write a training data file for later use. The primary reason to use this mode is if you wish to train on one input sequence, then analyze on another. To train on genome 1 and run on genome 2, for example, one would do:

$ prodigal -i genome1.fna -p train -t genome1.trn

This writes a training data file to genome1.trn, as specified by the -t option.

  -t, --training_file:  Specify the training file location.  In
                        train mode, writes the training file.
                        In normal mode, reads the training file.

To read in the genome 1 training file and use it to analyze genome 2, one would do:

$ prodigal -i genome2.fna -t genome1.trn -o genome2.gbk -a genome2.faa

Prodigal 2.x: Previous versions of Prodigal did not explicitly have a training mode. Training was invoked by simply calling the -t option with a file that did not already exist. Analyzing using a training file worked as above; if the file specified by -t exists, Prodigal 2.x reads and uses it. We decided this was too clunky and confusing, so we made the user explicitly specify training mode in newer versions.

For additional guidance based on the type of input sequence, see Advice By Input Type.