Skip to content

related-sciences/nxontology-data

Repository files navigation

NXOntology data: making ontologies accessible as simple JSON files

GitHub Actions CI Build Status
Software License
Code style: black

This repository imports public ontologies/taxonomies into Python NXOntology objects and writes the ontologies in the JSON-based node-link data format. The goal is to standardize and simplify data access to ontologies.

For ontologies that have been imported into NXOntology and exported to JSON, see the output/* branches on GitHub, for example output/pubchem.

Once you find the ontology you'd like to read, you can read in Python (after installing any dependenies like pip install nxontology):

# URL to the exported dataset.
# Here we read the ChEMBL protein/target classification hierarchy.
url = "https://github.com/related-sciences/nxontology-data/raw/output/pubchem/087_chembl_target_tree.json"
# Versioning with the commit hash is a good idea, since we might change the branch structure where data is stored.
url = "https://github.com/related-sciences/nxontology-data/raw/71cf538dc5c258ada880d58663b0205b7b7f8561/087_chembl_target_tree.json"

# To read as an NXOntology object,
# which encapsulates the networkx graph.
# Will also work for the gzip compressed files.
from nxontology import NXOntology
nxo = NXOntology.read_node_link_json(url)

# To read as a networkx.DiGraph
import requests
from networkx.readwrite.json_graph import node_link_graph
digraph = node_link_graph(requests.get(url).json())

or in R:

url <- "https://github.com/related-sciences/nxontology-data/raw/71cf538dc5c258ada880d58663b0205b7b7f8561/087_chembl_target_tree.json"
json_ont <- jsonlite::read_json(path = url)
digraph <- tidygraph::tbl_graph(
  nodes = dplyr::bind_rows(json_ont$nodes),
  edges = dplyr::bind_rows(json_ont$links),
)
digraph
#> # A tbl_graph: 904 nodes and 889 edges

Note: There's currently an open issue on reading in json.gz files with the R package jsonlite.

Sources

The data sources that are currently imported are listed below. Please open an issue if you are interested in contributing support for additional sources.

EFO

This project converts all three variants of the Experimental Factor Ontology (EFO, EFO OTAR Profile, and EFO OTAR Slim) into NXOntology objects. See nxontology_data/efo for a detailed README.

HGNC Gene Groups

HGNC (HUGO Gene Nomenclature Committee) maintains a directed acyclic graph of gene groups/families. See nxontology_data/hgnc for a detailed README. Output data is on the output/hgnc branch.

MeSH

MeSH (Medical Subject Headings) is created by the National Library Medicine and integrated into many projects including PubMed. See nxontology_data/mesh for a detailed README. Output data is on the output/mesh branch.

PubChem

We import ontologies from the PubChem Classifications service (see browser & docs). Most ontologies indexed by service do not originate with PubChem, but PubChem provides convenient and standardized bulk access. Output data is on the output/pubchem branch.

Development

# Install the environment
poetry install --no-root

# Update the lock file
poetry update

# Run tests
pytest

# Set up the git pre-commit hooks.
# `git commit` will now trigger automatic checks including linting.
pre-commit install

# Run all pre-commit checks (CI will also run this).
pre-commit run --all

License

This source code in this repository is released under an Apache License 2.0 License (see LICENSE.md). Source code refers to the contents of the main branch and any other development branches containing code and documentation.

The output branches contain data from external ontologies. Please refer to each respective ontology for its data license. If available, we include license information in the graph metadata for each ontology, but often license information is not supplied in the ontology data we ingest. Please attribute the source ontology when reusing data obtained from this project, and as best practice mention that the data was obtained via NXOntology data.

Any original data produced by this repository is released under a CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. As noted above, the underlying ontology data is not original to this repository and upstream licenses should be consulted.