Skip to content

spond/gb_taxonomy_tools

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

travis status

Use CMake to install

To build, use

cmake ./

make install (by default into /usr/local)

Convert GID to TaxID

These are four simple utilities which perform the following manipulations and visualization tasks on GenBank taxonomic information.

gid-taxid : convert a list of GenBank IDs and associated counts into the list of tripets: genbank id, taxonomy id, count. It requires access to (quite large) mapping files maintained by GenBank ftp://ftp.ncbi.nih.gov/pub/taxonomy/gi_taxid_nucl.dmp.gz which are tab separated lists of gid taxid count, e.g. the input line 160338813 160 is output as 160338813 436308 160

Try running it on as $gid-taxid tests/data/test.gid path/to/gi_taxid_nucl.dmp

The result should be as in tests/data/test.taxid

Convert TaxID into full taxonomic rankings

taxonomy-reader: convert the output of gid-taxid (i.e. gid taxid count triplets) into a fully expanded 22 level taxonomy based on NCBI classification. The program requires access to the nodes.dmp and names.dmp files which match taxid data to scientific names and define the taxonomic hierarchy ftp://ftp.ncbi.nih.gov/pub/taxonomy/taxdump.tar.gz.

Try running $taxonomy-reader path/to/names.dmp path/to/nodes.dmp and entering 160338813 436308 160 on the command line.

The output should be 160 436308 root Archaea n n n Thaumarchaeota n n n n n Nitrosopumilales n n Nitrosopumilaceae n n n Nitrosopumilus n Nitrosopumilus maritimus n

Typical usage involves piping the output file generated by git-taxid to taxonomy-reader, .e.g

cat test/data/test.taxid | $taxonomy-reader path/to/names.dmp path/to/nodes.dmp > test/data/test.taxonomy

##Convert taxonomic rankings into a tree and a text summary##

taxonomy2tree takes the output of taxonomy-reader and converts it to 2 outputs: a Newick tree file representing the hierarchical taxonomy and a summary file.

Try running $taxonomy2tree test/data/test.taxonomy 0 test/data/test.tree test/data/test_summary.txt 0

The output tree uses the standard Newick format with "branch lengths" representing samples representing the given taxonomic group.

((((((Nitrosopumilus maritimus:162)Nitrosopumilus:162)Nitrosopumilaceae:162)Nitrosopumilales:162,uncultured crenarchaeote 74A4:...

The output summary file is simply a tab-separated count:

root  root	10401
superkingdom	Archaea	295
superkingdom	Bacteria	9469
superkingdom	Eukaryota	553
superkingdom	Viruses	16
kingdom	Fungi	100
kingdom	Metazoa	231
kingdom	Viridiplantae	110
subkingdom	Dikarya	97
...

Convert the taxonomic tree into a PostScript image

tree2ps takes the Newick tree output of taxonomy2tree and converts it to a PostScript rendering subject to a variety of conditions.

The program arguments are as follows

  1. Newick tree file
  2. The file to write PostScript to
  3. Maximum taxonomic depth -- only show leaves this many or fewer steps away from the root. Use 0 or a negative number to show all levels.
  4. font size (in points)
  5. Maximum number of leaves -- display the tree up to the depth level (see 3) which has this many or fewer leaves.
  6. Count duplicate tax ids -- this is used for coloring the tree; if set to 0, only count the number of leaves below each node, ignoring the counts associated with the leaves themselves.

Try

$tree2ps test/data/test.tree test/data/tree1.ps 5 8 0 1

$tree2ps test/data/test.tree test/data/tree2.ps 0 8 256 1

$tree2ps test/data/test.tree test/data/tree3.ps 0 8 256 0