Skip to content

brunoasm/TaxReformer

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

51 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TaxReformer

This python program uses Open Tree Taxonomy and Global Names Architecture to correct mispellings, find senior synonyms and retrieve higher taxonomy for a list of taxonomic names of any level.

It tries to use Open Tree Taxonomy whenever possible, pulling information from other databases in case a name is not found there. The source of information retrieved is saved in the output, so records not on OTT can be easily filtered out.

We explain how we used Tax Reformer to produce a dataset of insect eggs in this citation:

Church SH, Donoughe S, de Medeiros BAS & Extavour CG 2019. A dataset of egg size and shape from more than 6,700 insect species. Scientific Data 6:104. DOI: 10.1038/s41597-019-0049-y

Dependencies

This script runs on Python 3 and needs the following python libraries:

pandas
fuzzywuzzy
requests

Additionally, you need to download GNparser. TaxReformer is compatible with GNparser v.1.6.7

In case GNparser is not installed in a folder in your $PATH, you need to provide its location (see usage below).

There is a docker image available for this program, which includes all dependencies. If you have Docker installed, use the following to run the latest version:

docker run -v $PWD:/input brunoasm/taxreformer input.csv 

(Assuming the input file is named input.csv - see information on input below)

Input

Default input format is a csv table containing a column named name including the names to be searched. Other columns are ignored and will be maintained in the output. See folder examples for a valid input file.

Output

After a successful run, the program will write two output files names matched_names.csv and unmatched_names.csv, for names that could and could not be matched, respectively. These include all columns initially present in the input data table, as well as new columns with information retrieved by TaxReformer.

Options

-h or --help Shows help

-o or --output Prefix to add to output files. Default is output.

-p or --gnparser Path to GNparser executable. Not needed if it is on $PATH

-c or --context Taxonomic context to use for Open Tree Taxonomy (see Open Tree of Life API for options). Defaults to "All life"

-f or --tax-filter Taxonomy contexts to use for other services. This is a comma-separated list of names of higher taxa in which queries must be included. Used to filter results from services other than Open Tree Taxonomy. A result matching any taxon in the list will be kept. Therefore, if a result is not included in any of these higher taxa, it will be excluded.

Examples

  1. To see available options, simply type: python TaxReformer.py -h

    With docker, use: docker run -v $PWD:/input brunoasm/taxreformer -h

  2. To find Arthropod names from a file named input.csv:

python TaxReformer.py --context Arthropods --tax-filter Arthropoda input.csv

With docker, use:

docker run -v $PWD:/input brunoasm/taxreformer --context Arthropods --tax-filter Arthropoda input.csv

  1. To find bird names from a file named input.csv:

python TaxReformer.py --context Birds --tax-filter Aves input.csv

With docker, use:

docker run -v $PWD:/input brunoasm/taxreformer --context Birds --tax-filter Aves input.csv

  1. Same as before, but giving the path to GNparser (in the same folder as the input)

python TaxReformer.py --gnparser ./gnparser --context Birds --tax-filter Aves input.csv

  1. To find names of termites, considering that they might be classified under roaches in some databases. Notice also that Open Tree of Life does not have a context for termites, so we will use insects instead:

python TaxReformer.py --context Insects --tax-filter Isoptera,Blattodea,Termitoidea,Blattaria input.csv

With docker, use:

docker run -v $PWD:/input brunoasm/taxreformer --context Insects --tax-filter Isoptera,Blattodea,Termitoidea,Blattaria input.csv

The folder examples contains a test input file and the expected output when running:

python TaxReformer.py examples/input.csv

Warnings

This program was developed for a specific application and I am slowly working to make it more generally useful. If you want to use it and run into trouble, don't hesitate adding an issue: https://github.com/brunoasm/TaxReformer/issues

The program tries its best to find your names in some database, but different databases have different taxon coverages and APIs also require different inputs. For that reason, it is hard to make bulk searches: each name will be searched individually, and this might happen several times per name if a match is not easily found. Open Tree of Life and Global Names Server might get mad at you if you make thousands or millions of requests to their servers. You should only use this tool for a somewhat small number of names each time you run. In our case, we searched a little less than 10,000 records, which took about one day.

Since each database uses different higher taxonomies, it is hard to delimit contexts. For example, Open Tree Taxonomy uses Birds to constrain search to birds, but to constrain the same search on other databases we need to filter out taxa not contained in Aves. To delimit search to your taxa of interest, you will have to play both with --context and tax-filter (see examples above)

Author

This program was written by Bruno de Medeiros. If you use it in a published research, please cite the following publication:

Church SH, Donoughe S, de Medeiros BAS & Extavour CG 2019. A dataset of egg size and shape from more than 6,700 insect species. Scientific Data 6:104. DOI: 10.1038/s41597-019-0049-y

About

A python program to check species names on Open Tree Taxonomy and Global Names Server

Topics

Resources

Stars

Watchers

Forks

Packages

No packages published