Skip to content

Understanding the Prodigal Output

hyattpd edited this page Aug 18, 2014 · 15 revisions

By default, Prodigal produces one output file, which consists of gene coordinates and some metadata associated with each gene. However, the program can produce four more output files at the user's request: protein translations (with the -a option), nucleotide sequences (with the -d option), a complete listing of all start/stop pairs along with score information (with the -s option), and a summary of statistical information about the genome or metagenome (with the -w option). Below, we explain the output of each of these types of files.

Gene Coordinates

The gene coordinates file lists the location of each gene as well as some additional scoring information. By default, Prodigal produces a Genbank-like feature table; however, the user can specify some other output types via the -f option:

  -f, --output_format:  Specify output format.
                          gbk:  Genbank-like format (Default)
                          gff:  GFF format
                          sqn:  Sequin feature table format
                          sco:  Simple coordinate output

The gff parameter produces Generic Feature Format Version 3 output. The sqn parameter produces a Sequin Feature Table format, although the user will still have to fill in product information to have a "submission-ready" Sequin file. The sco parameter produces Simple Coordinate Output, suitable if the user only desires gene coordinates and nothing else.

For each individual sequence in the FASTA input file, Prodigal produces a header containing a semicolon-delimited string with information (in the form of name=value pairs) about that sequence and how it was analyzed. In Genbank format, this is placed on a "DEFINITION" line. In GFF3, this information is separated into two comment lines.

Sample Genbank header:

DEFINITION  seqnum=1;seqlen=4639675;seqhdr="NC_000913 # Escherichia coli str. K-12 substr. MG1655, complete
 genome.";version=Prodigal.v2.60;run_type=Single;model="Ab initio";gc_cont=50.79;transl_table=11;uses_sd=1

Sample GFF3 header:

##gff-version  3
# Sequence Data: seqnum=1;seqlen=4639675;seqhdr="NC_000913 # Escherichia coli str. K-12 substr. MG1655, com
plete genome."
# Model Data: version=Prodigal.v3.0.0-devel.1.0;run_type=Normal;model="Ab initio";gc_cont=50.79;transl_tabl
e=11;uses_sd=1

The fields in this header are as follows:

  • seqnum: An ordinal ID for this sequence, beginning at 1.
  • seqlen: Number of bases in the sequence.
  • seqhdr: The entire FASTA header line.
  • version: Version of Prodigal used to analyze this sequence.
  • run_type: "Ab initio" for normal mode, "Anonymous" for anonymous mode.
  • model (Anonymous mode only): Information about the preset training file used to analyze the sequence.
  • gc_cont: % GC content of the sequence.
  • transl_table: The genetic code used to analyze the sequence.
  • uses_sd: Set to 1 if Prodigal used its default RBS finder, 0 if it scanned for other motifs.

In addition to the strand (as specified by the "+/-" tag in GFF, or by the "complement" keyword indicating reverse strand in Genbank formats) and gene boundaries, Prodigal produces a semicolon-delimited string of name-value pairs (in the form name=value) with scoring and statistical information about each gene.

In Genbank format, this is placed on a "note" line, like so:

     CDS             2655..2882
                     /note="ID=1_3;partial=00;start_type=ATG;stop_type=TAA;rbs_motif=AGxAGG/AGGxGG;rbs_spac
er=5-10bp;gc_cont=0.241;conf=100.00;score=44.71;cscore=32.81;sscore=11.90;rscore=9.40;uscore=-1.05;tscore=3
.55;"

In GFF3 format, the string is placed in the last field, so the gene occupies only one line instead of two, like so:

NC_000913	Prodigal_v3.0.0-devel.1.0	CDS	337	2799	338.7	+	0	ID=1_2;part
ial=00;start_type=ATG;stop_type=TGA;rbs_motif=GGAG/GAGG;rbs_spacer=5-10bp;gc_cont=0.531;conf=99.99;score=33
8.70;cscore=322.16;sscore=16.54;rscore=11.24;uscore=1.35;tscore=3.95;

The GFF3 format requires an "id" in the first field. Prodigal pulls the first word from the FASTA header and uses that as its id. This id is not guaranteed to be unique (the first word of various headers in the file could be identical), so we recommend the user rely on the "ID" field in the semicolon-delimited string instead.

The fields in the semicolon-delimited string are as follows:

  • ID: A unique identifier for each gene, consisting of the ordinal ID of the sequence and an ordinal ID of that gene within the sequence (separated by an underscore). For example, "4_1023" indicates the 1023rd gene in the 4th sequence in the file.
  • partial: An indicator of if a gene runs off the edge of a sequence or into a gap. A "0" indicates the gene has a true boundary (a start or a stop), whereas a "1" indicates the gene is "unfinished" at that edge (i.e. a partial gene). For example, "01" means a gene is partial at the right boundary, "11" indicates both edges are incomplete, and "00" indicates a complete gene with a start and stop codon.
  • start_type: The sequence of the start codon (usually ATG, GTG, or TTG). If the gene has no start codon, this field will be labeled "Edge".
  • stop_type: The sequence of the stop codon (usually TAA, TGA, or TAG). If the gene has no stop codon, this field will be labeled "Edge".
  • rbs_motif: The RBS motif found by Prodigal (e.g. "AGGA" or "GGA", etc.)
  • rbs_spacer: The number of bases between the start codon and the observed motif.
  • gc_cont: The GC content of the gene sequence.
  • gc_skew: The GC skew of the gene sequence.
  • conf: A confidence score for this gene, representing the probability that this gene is real, i.e. 78.3% means Prodigal believes that gene is real 78.3% of the time and a false positive 21.7% of the time.
  • score: The total score for this gene.
  • cscore: The hexamer coding portion of the score, i.e. how much this gene looks like a true protein.
  • sscore: A score for the translation initiation site for this gene; it is the sum of the following three fields.
  • rscore: A score for the RBS motif of this gene.
  • uscore: A score for the sequence surrounding the start codon.
  • tscore: A score for the start codon type (ATG vs. GTG vs. TTG vs. Nonstandard).
  • mscore: A score for the remaining signals (stop codon type and leading/lagging strand information).

Prodigal 2.x: Older versions of Prodigal do not have the stop_type, gc_skew, or mscore fields. In addition, the uscore field referred to a score of only the sequence upstream of the start codon.

Protein Translations

The protein translation file consists of all the proteins from all the sequences in multiple FASTA format. The FASTA header begins with a text id consisting of the first word of the original FASTA sequence header followed by an underscore followed by the ordinal ID of the protein. This text id is not guaranteed to be unique (it depends on the FASTA headers supplied by the user), which is why we recommend using the "ID" field in the final semicolon-delimited string instead.

An example header for the 4th protein in the E. coli genome with id NC_000913:

>NC_000913_4 # 3734 # 5020 # 1 # ID=1_4;partial=00;start_type=ATG;rbs_motif=GGA/GAG/AGG;rbs_spacer=5-10bp;g
c_cont=0.528

The next three fields in the header, delimited by "#" signs, are the leftmost coordinate in the genome, the rightmost coordinate, and the strand (1 for forward strand genes, -1 for reverse strand genes). Following the coordinate information is a semicolon-delimited string identical to the one described in the gene coordinates file (see the list there for field definitions), using only the following fields: ID, partial, start_type, stop_type, rbs_motif, rbs_spacer, gc_cont, and gc_skew, and conf. The header does not contain any of the scoring information about that gene except for the conf field.

Nucleotide Sequences

The nucleotide sequence file produces multiple FASTA output following the same rules and conventions as described in the Protein Translations section. The only additional point worth noting is that Prodigal uses the DNA alphabet to produce these sequences, not mRNA (so you will see 'T' in the output and not 'U').

Starts File

Summary Statistics