Skip to content

sparks-baird/nomad-examples

Repository files navigation

nomad-examples DOI

Examples of using the Novel Materials Discovery (NOMAD) database, especially downloading all chemical formulas.

Installation

Clone or download the repository. To clone:

git clone https://github.com/sparks-baird/nomad-examples.git
cd nomad-examples

Install the dependencies, e.g. via:

pip install -r requirements.txt

Reproducer

Use all_formula_basic_metadata.py to download the data from NOMAD and to do some basic processing. This might take somewhere around an hour.

python -m all_formula_basic_metadata.py

Use remove_duplicate_compositions.py to process the chemical formulas down to a list of unique chemical compositions (represented as reduced formulas). This also might take around an hour.

python -m remove_duplicate_compositions.py

Data Descriptions

The data is available via figshare DOI: 10.6084/m9.figshare.19319783.v3 and was downloaded on 2022-03-07. There are four files available: all-formula.csv, unique-formula.csv, unique-reduced-formula.csv, and bad-formula.csv. There are 11680557, 764431, 695612, and 15 rows for each of these files, respectively. Descriptions are given below.

all-formula.csv

all-formula.csv contains two columns: calc_id (Calculation ID) and formula (Chemical Formula). These were restricted to VASP DFT calculations, and do not include noble gases nor radioactive elements. Some calculation IDs have missing chemical formulas.

unique-formula.csv

The list has also been filtered down to unique (non-reduced) chemical formulas in unique-formula.csv along with the calc_id for each unique formula. No structural information is included directly in this data.

unique-reduced-formula.csv

REALLY, what you're probably most interested in is unique-reduced-formula.csv because it is the most curated and is directly usable with e.g. pymatgen. This contains three columns: calc_id, reduced_formula, and factor which correspond to the Calculation ID, the reduced formula (e.g. Si2O4 --> SiO2), and the factor (e.g. for Si2O4 --> SiO2 the factor is 2). The formulas were first parsed via the pymatgen.core.Composition class.

bad-formula.csv

Finally, bad-formula.csv contains the formulas that were skipped during processing (i.e. not successfully processed with pymatgen.core.Composition for various reasons comprising 15 in total).

Future Work

Downloading all of the crystal structures and reducing this to a list of unique phases each with a CIF file.

Issues

See something missing? Please don't hesitate to drop me a note in issues.