Skip to content

oasci/reptar

Repository files navigation

reptar

Compute, store, and analyze manuscript-scale data for computational chemistry and biology

Build Status PyPI - Python Version codecov GitHub release (latest by date) DOI License GitHub repo size Black style Black style

MotivationInstallationFile TypesKey-value pairsWorkflowLicense

Motivation

The computational chemistry and biology communities often fails to openly provide raw and/or processed data used to draw their scientific conclusions.

For large projects, frameworks such as QCArchive, Materials Project, Pitt Quantum Repository, ioChem-BD and many others provide great storage solutions. This approach would not be practical for fluid data pipelines and small-scale projects such as a single manuscript.

Alternatively, you could use individual files in formats such as JSON, XML, YAML, npz, etc. These are great options for customizable data storage with their own advantages and disadvantages. However, you often must choose between (1) a standardized parser that might not support your workflow or (2) writing your own.

Reptar is designed for easy data storage and analysis for individual projects. Customizable parsers provide a simple way to extract new data without submitting issues and pull requests (although this is highly encouraged). While files are the heart of reptar, it strives to be file-type agnostic by providing the same interface for all supported file types. The result is a user-specified file streamlined for analysis in Python and archival on places such as GitHub and Zenodo.

Installation

You can install reptar from PyPI by using pip install reptar. Or, the latest development version can be installed directly from the GitHub repository or from TestPyPI.

git clone https://github.com/oasci/reptar
cd reptar
pip install .

File types

Reptar supports four file types with a single interface: exdir, zarr, JSON, and npz. JSON is a text file for storing key-value pairs with few dimensions (i.e., no large arrays). NumPy's npz format is useful for arrays; however, no nesting is possible and loading data often requires postprocessing for 0D arrays (e.g., np.array('data')).

Exdir is a simple, yet powerful open file format that mimics the HDF5 format with metadata and data stored in directories with YAML and npy files instead of a single binary file. For more detailed information, please read this Front. Neuroinform. article about exdir. Zarr is a similar hierarchical data format for chunked and compressed NumPy-like arrays and JSON attributes. Both of these file types provide several advantages such as mixing human-readable and binary files, being easier for version control, and only loading requested portions of arrays into memory.

Key-value pairs

All data is stored under a key-value pair within the reptar framework. The key tells reptar where the data is stored and is conceptually related to standard file paths (without file extensions). Nested data is specified by separating the nested keys with a /. For example, energy_pot, md_run/geometry, and entity_ids are all valid keys. Note that gradients and /gradients would translate to the same value (/ species the "root" of the file).

Workflow

Storing data

We refer to a "reptar file" as any file that can be used with the reptar.File class. Creating a reptar file starts by having a set of data files generated from some calculation. Paths to these data files are passed into reptar.Creator.from_calc that extracts information using a reptar.parser class. Information parsed from these files, parsed_info, is then used to populate a reptar.File object.

Data can also be manually added by using File.put(key, data) where key is a string specifying where to store the data.

Accessing data

Data can be added or retrieved using the same interface regardless of the underlying file format (e.g., exdir, JSON, and npz). The only thing required is the respective key specifying where it is stored. Then, File.get(key) can retrieve the data.

When working with JSON and npz files, File.save() must be explicitly called after any modification.

Writing to other formats

Other packages often require data to be formatted in their own specific way. Reptar provides ways to extract data from reptar files using File.get(key) and passing it into the desired reptar.writer function. Reptar currently automates the creation of:

License

Distributed under the MIT License. See LICENSE for more information.