Skip to content

Advice By Input Type

hyattpd edited this page Aug 21, 2014 · 43 revisions

Finished Genomes

We define a finished genome to be a genome where each chromosome or plasmid is in one contig, and there are no runs of N's (gaps).

Finished genomes should be run in Normal Mode.

For genomes where you are sure the first and last bases of the sequence(s) do not fall inside a gene, you should consider the -c option.

   -c, --closed:         Closed ends.  Do not allow partial genes
                         at edges of sequence.

If the genome consists of multiple chromosomes, you can analyze them together or separately. Chromosomes should only be separated if (1) each chromosome is at least 500kb, and (2) you have reason to believe the chromosomes are quite different in terms of GC content, RBS motif usage, and other parameters.

Plasmids are trickier, and it isn't clear what the best approach is. They can either be included alongside the chromosomes (in which case Prodigal will train on the chromosomes and plasmids together), or you can analyze them separately, as discussed below. Your decision should be guided again by whether or not the plasmid is similar to or different from the rest of the genome.

Draft Genomes

In most cases, draft genomes should be analyzed in Normal Mode. Prodigal should do fine even if the average contig length is small (3000+bp). Alternatively, the presence of even one long contig is usually sufficient to provide good training data.

If Prodigal is having trouble building a good training set (due to the sequence being in too many contigs), it will output warnings that look like this:

Warning: Training sequence is highly fragmented.
You may get better results with the '-p anon' option.

By default, Prodigal's parameters are ideal for scaffolds and/or multiple FASTA with many contigs. Partial genes are allowed to run into gaps of N's, which means you should get the same results analyzing 1000 contigs in one file, or analyzing one scaffold with the 1000 contigs joined together by runs of N's. In addition, genes are allowed to run off the edges.

Prodigal can handle gaps (defined as two or more consecutive codons of completely ambiguous characters) a variety of ways, using the -e option:

  -e, --gap_mode:       Specify gap-handling behavior.
                          0:    Partial genes run into gaps.
                                (Default)
                          1:    Genes cannot run into gaps.
                          2:    Do not treat N's as gaps.

In some rare cases, where you are analyzing a scaffold and you are certain you have the exactly correct number of N's in all your gaps (so as to preserve reading frame), you might choose the -e 2 option, which would allow Prodigal to build gene models that span the gaps. You might also use this option if your contigs are large but your sequence is low quality and contains many short runs of N's that are not meant to be treated as gaps.

Prodigal 2.x: Older versions of Prodigal do not contain gap handling (except for the -m option, >which acts similarly to -e 1 above, but requires a run of 50 N's before it considers it a gap).

If you feel like your draft genome is in too many contigs to get a good result (or if you see the warnings shown above), an alternative is to find a closely related genome that is finished, train on it, and use that training file to analyze your highly fragmented draft genome. This process is described in the section on Training Mode.

If your genome is in low quality draft, and you do not have a high quality closely related genome to train on, you should analyze the sequence in Anonymous Mode.

TIP: You should be careful using the -c option with draft genomes in many contigs, as this will prevent Prodigal from predicting partial genes. Similarly, if you have a single scaffold with many gaps in it, you should be careful using the -e option, as you may also lose many partial genes.

Metagenomes

The simplest approach for metagenomes is to put all the sequences in one FASTA file and analyze them in Anonymous Mode. This will produce reasonable results (about 95% as good as if Prodigal had been trained on the actual genomes). It also has the advantage of being easily parallelized, as each sequence in the file can be processed independently from any other sequence in the file.

A more ideal solution, when possible, is to assemble as many genomes as you can from the sample, put each genome into a FASTA file, and analyze each genome using Normal Mode. You can then analyze the leftovers using anonymous mode.

Similarly, you might bin the sequences using a binning or classification program (these programs usually rely on GC content, BLAST searches, k-mer searches, or other information). You could then make a multiple FASTA file from each bin and analyze it using normal mode.

TIP: Never analyze a multiple FASTA file containing sequences from more than one genome using normal mode. The only exception to this rule would be if the genomes are closely related (strains of the same species).

Both of the above solutions should produce better results than anonymous mode, since Prodigal always does better when it can train on the sequence itself rather than relying on preset training files. These methods involve a lot of preprocessing work, though, and cannot be run as easily in parallel. The fastest solution is just to use anonymous mode.

Sequencing errors, once common in metagenomic samples, are becoming less of a problem as sequencing technology improves. We chose not to spend time addressing this issue in Prodigal. Nonetheless, if your reads contain many insertions/deletions, you may be better off analyzing your sample with a program like FragGeneScan, which is specifically designed to handle sequencing errors.

Alternate Genetic Codes

Prodigal supports all genetic codes defined by NCBI. Most bacteria and archaea use genetic code 11, which uses three stop codons (TAA, TGA, and TAG). Some bacteria do not use TGA as a stop codon. Mycoplasma, spiroplasma, and ureaplasma translate UGA to tryptophan (W) (genetic code 4), while bacteria using genetic code 25 translate UGA to glycine (G).

By default, Prodigal tries genetic code 11. If the average gene length is too low, it tries genetic code 4. If the average gene length is still too low, it reverts back to genetic code 11 and outputs a warning. This looks like the following:

Building training set using genetic code 11...done!
Checking average training gene length...459.7, too low.
Trying genetic code 4...still bad, reverting to genetic code 11.
Redoing genome with genetic code 11...done.

Warning: Average training gene length is low (459.7).
Double check translation table or check for pseudogenes/gene decay.

Examining upstream regions and training starts...done.

Prodigal cannot automatically distinguish between genetic code 4 and genetic code 25. In such cases, it will likely choose genetic code 4, and you will need to rerun manually using genetic code 25.

Autodetection is highly reliable and shouldn't make any mistakes (we tested on 20,000 genomes and it did not make any errors in genetic code determination). However, the user can also explicitly specify genetic code using the -g option.

  -g, --trans_table:    Specify a translation table to use.
                          auto: Tries 11 then 4 (Default)
                          11:   Standard Bacteria/Archaea
                          4:    Mycoplasma/Spiroplasma
                          #:    Other genetic codes 1-25

This will be necessary for any organisms using genetic code 25, and potentially some 4's that don't get recognized by the autodetection. If you know the genetic code of your genome, you might as well override the autodetection and explicitly specify it using this option.

Prodigal 2.x: Previous versions of Prodigal just do genetic code 11 by default, and do not autodetect 11 and 4. To analyze a genome with genetic code 4, you must explicitly specify this translation table with the -g 4 option.

Organisms with Gene Decay

If you run Prodigal in normal mode and you see the following warning:

Warning: Average training gene length is low (459.7).
Double check translation table or check for pseudogenes/gene decay.

then something may be "wrong" with the final gene predictions. Prodigal prints the above warning when the average gene length in its training set is less than 600bp. Some organisms may just have smaller than average genes. If that's the case, you can ignore the warnings and proceed as normal

Prodigal 2.x: Previous versions of Prodigal do not output a warning about potential gene decay.

However, the organism may have undergone extensive gene decay (like in Mycobacterium leprae, for example), and many of the genes predicted by Prodigal may be gene fragments, pseudogenes, or simply nonsense (false positives). In such cases, you may have to filter out a lot of these predictions.

One alternative which may help is to train on a closely related genome using Training Mode. For example, if Prodigal is trained on Mycobacterium tuberculosis, and that training file is applied to Mycobacterium leprae, then Prodigal predicts many fewer genes than when run in normal mode directly on M. leprae. In order for this process to work, the training genome must be good quality (and not have undergone decay itself) and not be too distant from the genome you wish to analyze.

Plasmids, Phages, Viruses, and Other Short Sequences

Isolated, short sequences (<100kbp) such as plasmids, phages, and viruses should generally be analyzed using Anonymous Mode. In the case of plasmids, you may be better off combining the plasmids with the chromosome(s) and running everything in Normal Mode. If you believe the plasmid to be quite different from the rest of the genome, however, then anonymous mode is the best option. If the sequence is long enough, you should use Normal Mode instead.

We have never extensively examined exactly how much sequence Prodigal requires in order to produce good results. It is our general feeling that 20kbp-100kbp is too short to gather enough statistics to predict genes well, but we have no actual data to support this claim. If you feel comfortable running 50kbp sequences in normal mode, and you're happy with the results you're getting, then, by all means, continue to do so. Start sites, especially, require a lot of data (100kbp may only be ~80 or so start sites in the training set, since Prodigal only trains on the starts of longer ORFs). We recommend 500kbp+ to get ideal 5' predictions, although this is likely on the conservative side. If your sequence is in this nebulous range (20kbp to 100kbp for 3' predictions, 100kbp to 500kbp for 5' predictions), you may try running both normal and anonymous mode and manually inspecting the results for any differences.

Prodigal contains no special routines to deal with viruses. As such, it cannot handle certain phenomena that occur sometimes in viruses, such as translational frame shifts. Viruses should generally be analyzed as above, with short genomes analyzed in anonymous mode and longer ones in normal mode.