BERTOS: transformer language model for oxidation state prediction
Citation: Fu, Nihang, Jeffrey Hu, Ying Feng, Gregory Morrison, Hans‐Conrad zur Loye, and Jianjun Hu. "Composition Based Oxidation State Prediction of Materials Using Deep Learning Language Models." Advanced Science (2023): 2301011. Link
Nihang Fu, Jeffrey Hu, Ying Feng, Jianjun Hu*
Machine Learning and Evolution Laboratory
Department of computer science and Engineering
University of South Carolina
- Set up virtual environment
conda create -n bertos
conda activate bertos
- PyTorch and transformers for computers with Nvidia GPU.
conda install pytorch==1.8.0 torchvision==0.9.0 torchaudio==0.8.0 cudatoolkit=11.1 -c pytorch -c conda-forge
conda install -c conda-forge transformers
If you only has CPU on your computer, try this:
pip install transformers[torch]
If you are using Mac M1 chip computer, following this tutorial or this one to install pytorch and transformers.
- Other packagess
pip install -r requirements.txt
Our training process is carried out on our BERTOS datasets. After extracting the data under datasets
folder, you will get the following four folders ICSD
, ICSD_CN
, ICSD_CN_oxide
, and ICSD_oxide
.
Quickly run the script to train a BERTOS using the OS-ICSD-CN training set and save the model into the ./model_icsdcn
folder.
bash train_BERTOS.sh
The command to to train a BERTOS model.
python train_BERTOS.py --config_name $CONFIG_NAME$ --dataset_name $DATASET_LOADER$ --max_length $MAX_LENGTH$ --per_device_train_batch_size $BATCH_ SIZE$ --learning_rate $LEARNING_RATE$ --num_train_epochs $EPOCHS$ --output_dir $MODEL_OUTPUT_DIRECTORY$
We use ICSD_CN
dataset as an example:
python train_BERTOS.py --config_name ./random_config --dataset_name materials_icsd_cn.py --max_length 100 --per_device_train_batch_size 256 --learning_rate 1e-3 --num_train_epochs 500 --output_dir ./model_icsdcn
If you want to change the dataset, you can use different dataset file to replace $DATASET_LOADER$
, like materials_icsd.py
, materials_icsdcn.py
, materials_icsdcno.py
, and materials_icsdo.py
. And you can also follow the intructions of huggingface to build you own custom dataset.
Run getOS.py
file to get predicted oxidation states for a input formula or input formulas.csv file containing multiple formulas.
Using your model:
python getOS.py --i SrTiO3 --model_name_or_path ./model_icsdcn
python getOS.py --f formulas.csv --model_name_or_path ./model_icsdcn
Using pretrained model:
python getOS.py --i SrTiO3 --model_name_or_path ./trained_models/ICSD_CN
python getOS.py --f formulas.csv --model_name_or_path ./trained_models/ICSD_CN
Our trained models can be downloaded from figshare BERTOS models, and you can use it as a test or prediction model.
Removing OS
, the datasets under datasets
folder correspond to the datasets in the figure.
We use the transformer model as implmented in Huggingface.
@article{wolf2019huggingface,
title={Huggingface's transformers: State-of-the-art natural language processing},
author={Wolf, Thomas and Debut, Lysandre and Sanh, Victor and Chaumond, Julien and Delangue, Clement and Moi, Anthony and Cistac, Pierric and Rault, Tim and Louf, R{\'e}mi and Funtowicz, Morgan and others},
journal={arXiv preprint arXiv:1910.03771},
year={2019}
}
Fu, Nihang, Jeffrey Hu, Ying Feng, Gregory Morrison, Hans‐Conrad zur Loye, and Jianjun Hu. "Composition Based Oxidation State Prediction of Materials Using Deep Learning Language Models." Advanced Science (2023): 2301011. [PDF](https://arxiv.org/pdf/2211.15895)
If you have any problem using BERTOS, feel free to contact via funihang@gmail.com.