LLM-Prop: Predicting Physical And Electronic Properties Of Crystalline Solids From Their Text Descriptions

This repository contains the implementation of the LLM-Prop model. LLM-Prop is an efficiently finetuned large language model (T5 encoder) on crystals text descriptions to predict their properties. Given a text sequence that describes the crystal structure, LLM-Prop encodes the underlying crystal representation from its text description and output its properties such as band gap and volume.

LLM-Prop architecture

Installation

You can install LLM-Prop by following these steps:

git clone https://github.com/vertaix/LLM-Prop.git
cd LLM-Prop
conda create -n <environment_name> requirement.txt
conda activate <environment_name>

Usage

Training LLM-Prop from scratch

Add the following scripts to llmprop_train.sh

#!/usr/bin/env bash

TRAIN_PATH="data/samples/textedge_prop_mp22_train.csv"
VALID_PATH="data/samples/textedge_prop_mp22_valid.csv"
TEST_PATH="data/samples/textedge_prop_mp22_test.csv"
EPOCHS=5 # the default epochs is 200 to get the best performance
TASK_NAME="regression" # the task name can also be set to "classification"
PROPERTY="band_gap" # the property can also be set to "volume" or "is_gap_direct". Note that if the task name is set to classification, only "is_gap_direct" is allowed here. And if the task name is set to regression, only "band_gap" or "volume" is allowed here.

python llmprop_train.py \
--train_data_path $TRAIN_PATH \
--valid_data_path $VALID_PATH \
--test_data_path $TEST_PATH \
--epochs $EPOCHS \
--task_name $TASK_NAME \
--property $PROPERTY

Then run bash scripts/llmprop_train.sh

Evaluating the pretrained LLM-Prop

Add the following scripts to llmprop_evaluate.sh

#!/usr/bin/env bash

TRAIN_PATH="data/samples/textedge_prop_mp22_train.csv"
TEST_PATH="data/samples/textedge_prop_mp22_test.csv"
TASK_NAME="regression" # the task name can also be set to "classification"
PROPERTY="band_gap" # the property can also be set to "volume" or "is_gap_direct". Note that if the task name is set to classification, only "is_gap_direct" is allowed here. And if the task name is set to regression, only "band_gap" or "volume" is allowed here.
CKPT_PATH="checkpoints/samples/$TASK_NAME/best_checkpoint_for_$PROPERTY.tar.gz" # path to the best model if the property to be predicted

python llmprop_evaluate.py \
--train_data_path $TRAIN_PATH \
--test_data_path $TEST_PATH \
--task_name $TASK_NAME \
--property $PROPERTY \
--checkpoint $CKPT_PATH

Then run bash scripts/llmprop_evaluate.sh

Data availability

This work is still under review and the data will be released after the review process.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
figures		figures
google		google
plots		plots
statistics/samples/regression		statistics/samples/regression
stopwords/en		stopwords/en
README.md		README.md
llm_args_parse.py		llm_args_parse.py
llmprop_args_parser.py		llmprop_args_parser.py
llmprop_dataset.py		llmprop_dataset.py
llmprop_evaluate.py		llmprop_evaluate.py
llmprop_model.py		llmprop_model.py
llmprop_train.py		llmprop_train.py
llmprop_train_encode-decode.py		llmprop_train_encode-decode.py
llmprop_utils.py		llmprop_utils.py
requirements.txt		requirements.txt
test.ipynb		test.ipynb
test.txt		test.txt

dwzhang98/LLM-predictor

Folders and files

Latest commit

History

Repository files navigation

Installation

Usage

Training LLM-Prop from scratch

Evaluating the pretrained LLM-Prop

Data availability

Citation

About

Resources

Stars

Watchers

Forks

Languages