Metal-Insulator Transition Classifiers

This repository contains the code and data used in constructing the thermally-driven metal-insulator transition (MIT) classifiers, which are 3 binary classifiers: a Metal vs. non-Metal model, an Insulator vs. non-Insulator model and an MIT vs. non-MIT model.

Check out our paper on Chemistry of Materials:

Georgescu, A. B.; Ren, P.; Toland, A. R.; Zhang, S.; Miller, K. D.; Apley, D. W.; Olivetti, E. A.; Wagner, N.; Rondinelli, J. M. Database, Features, and Machine Learning Model to Identify Thermally Driven Metal−Insulator Transition Compounds. Chem. Mater. 2021. DOI: 10.1021/acs.chemmater.1c00905.

Note: The results in the Chem. Mater. paper are produced with the code and data sets in release v1.2.2.

Table of Content

Model Description
General Workflow
Demo Notebooks

Model Description

Research Question

The research question of this project is whether a machine learning classification model can predict temperature-driven metal-insulator transition behavior based on a series of compositional and structural descriptors/features of a given compound.

Training Algorithm

The training algorithm or the model type chosen for this task is an XGBoost tree classifier implemented in the Python programming language. XGBoost models have helped won numerous Kaggle competitions and have been shown to perform well on classification tasks. For this research project, if you wonder why we chose XGBoost over other model types and why binary classification over multi-class classification, you can refer to this section. The takeaway is that XGBoost is consistently among the best performing model types and that it is faster to train compared to other models with comparable performance. The performance across all model types on binary classifications is also better than that on multi-class classifications.

A Word of Caution

Since the vast majority of the training data comes from oxides and there are not that many well-documented oxides that exhibit MIT behavior, the training dataset as a result is quite small for machine learning standards (343 observations / rows). Thus, the models, especially with a high dimensional feature set, can easily overfit and there is an ongoing effort to expand and find new MIT materials to add to the dataset. Thus, as we continue to expand our dataset, the models trained on the dataset are also subject to change over the course of time.

We strongly encourage people to contribute temperature-driven MIT materials that aren't already included in our dataset. Please include your name, institution, the CIF file and reference publications in your email and send them to Professor James M. Rondinelli.

You can also suggest new MIT material(s) by opening an issue with the New MIT material template.

General Workflow

1. Data Preparation

1.1 Getting CIF files

The CIF files are obtained through online databases such as ICSD database, Springer Materials and Materials Project in addition to a few hand generated ones. The vast majority of CIF files are high-quality experimental structures files from the ICSD database, with a few from the Springer and Materials Project databases.

Note: Unfortunately, we can not directly share the collected CIF files due to copyright concerns. However, you can find the material ID of the compounds included in our dataset here (you should look at the struct_file_path column to find the IDs). Should you have access, you can use those IDs to download CIF files from ICSD, Springer and Materials Project. You will find 4 suffixes in struct_file_path which correspond to 4 sources as follows.

Suffix	Source
CollCode	ICSD
SD	Springer Materials
MP	Materials Project
HandGenerated	Generated by hand based on publications

1.2 Generate ionization lookup dataframe

This step creates an ionization lookup table that is used in the subsequent featurization process.

1.3 Generate features using the CIF files

A total of 164 compositional and structural features are generated using a combination of matminer and our in-house handbuilt featurizers. These features then undergo further processing and selection down the pipeline.

1.4 Clean up the data

After a brief exploratory data anaylsis, it is found that the raw output from the featurizers contains features with missing values, zero-variance (i.e. the feature value is the same for all compounds) and high linear correlation (greater than 0.95). Therefore, the data cleaning process is carried out in the following order:

Drop rows / compounds with more than 10 missing features
Impute missing values with KNNImputer
- For each row with missing values, find the 5 nearest neighbors using features that are not missing
- Impute missing values based on features in the 5 nearest neighbors weighted by their distance
Remove features with zero variance
Remove features with high linear correlation
- Find features with a linear correlation greater than 0.95
- Drop one of the two features in each pair of highly correlated features

After data cleaning, the dataset now has 106 (105 numeric & 1 one-hot-encoded categorical with 2 levels) features remaining and will be referred to as the full feature set from now on.

2. Model Building

The model building process follows an iterative approach. During the first iteration, the cleaned-up full feature set is fed into the classifiers, trained and then evaluated. Then with the help of SHAP values and domain knowledge, features with high importance are selected and used as input to the second iteration of model training and evaluation.

2.1 Tune the XGBoost model

The training process starts with hyperparameter tuning with grid search cross validation. The default parameter search grid for the XGBClassifier is as follows.

Parameter	Search space
n_estimators	[10, 20, 30, 40, 80, 100, 150, 200]
max_depth	[2, 3, 4, 5]
learning_rate	np.logspace(-3, 2, num=6)
subsample	[0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
scale_pos_weight	[num_of_negative_class / num_of_positive_class]
base_score	[0.3, 0.5, 0.7]
random_state	[seed]

The scoring metric during tuning is f1_weighted. The best tuned parameters are then stored for model evaluation,

2.2 Evaluate performance and save models

Due to the scarcity of training examples, stratified 5-fold cross validation (cv) is used to evaluate model performance instead of a hold-out test set. There are 4 evaluation metrics used:

Since the cross validation splits depend on the random seed, a list of 10 seeds (integers from 0 to 9) are used to take into account the variation in model performance due to different splits from different seeds. For each seed, a stratified 5-fold cv is carried out, from which the median / mean values for the metrics are obtained. With 10 seeds, there are 10 median / mean values for each metric and finally a median / mean value is calculated from those 10 values, along with the interquartile range / standard deviation respectively. Essentially, the values reported are either a median of medians by default or an average of averages should you choose so.

After model evaluation, the models are trained on the entire dataset (343 compounds with the full feature set) with the best parameters and then stored.

2.3 Select important features and iterate

Using the stored models, a SHAP analysis is carried out to find the most important features. These important features are further screened using domain knowledge. Currently, 10 features are selected to create a reduced feature set. This feature selection step mainly serves to prevent overfitting.

With this reduced feature set, the entire model building process is repeated and the models are re-tuned, re-evaluated and re-trained on the reduced feature set.

3. Deploy & Serve Models

The trained classifiers are made available to the larger materials science community through Jupyter notebooks hosted via the Binder service. One can immediately upload a CIF file and easily make a prediction using our classifiers directly in the web browser.

The models served on the Binder server are by default based on the reduced feature set.

Demo Notebooks

There are several Jupyter notebooks available for easier result replication and demonstration purposes. You can immediately launch interactive versions of these notebooks in your web browser by clicking on the binder icon above or clicking on the subsection titles below.

Note: Any changes made on the server will not be saved unless you download a copy of the notebook onto your local machine.

You can replicate the workflow by using the notebooks in the following order.

generate_lookup_table.ipynb

This notebook generates the ionization energy lookup spreadsheet.

generate_compound_features.ipynb

This notebook allows you to generate features for all the structures. As mentioned before, since we cannot share the structure files, running this notebook will not work due to the absence of CIF files.

EDA_and_data_cleaning.ipynb

This notebook presents an exploratory data analysis along with a data cleaning process on the output dataset from generate_compound_features.ipynb.

model_building_and_eval.ipynb

This notebook contains the code that tunes, trains and evaluates the models along with a SHAP analysis on models trained with the full feature set. It is NOT recommended to train the models directly on the Binder server since it is a very memory intensive process (it will also take a very long time to train!). The Binder container by default has 2GB of RAM and if the memory limit is exceeded, there is a possibility that the kernel will restart and you'll have to start over. That being said, you are welcome to download the repository onto your local machine and play around with the model parameters and selection.

pipeline_demo.ipynb

This notebook demonstrates the prediction pipeline through which a prediction is made on a new structure that is not included in the original training set. You can even upload your own CIF structure and get a prediction! If you just want to play around with the trained models or make a prediction on a structure of your own choice, you can start here.

Supporting notebooks

model_comparison.ipynb

This notebook answers the question of "Why should one choose XGBoost over some other models?" by comparing the classification performance of 6 model types on the full feature set across 4 classification tasks. The model types are as follows.

Model type	Description
DummyClassifier	Naive models that are always random guessing (baseline performance)
LogisticRegression	Linear classifiers with L2 regularization
DecisionTreeClassifier	Generic decision tree classifiers
RandomForestClassifier	Ensemble decision tree classifiers
GradientBoostingClassifier	Gradient-boosting tree classifiers
XGBoostClassifier	Extreme gradient-boosting tree classifiers

The 4 classification tasks are:

Metal vs. non-Metals (Insulators + MITs)
Insulator vs. non-Insulators (Metals + MITs)
MIT vs. non-MIT (Metals + Insulators)
Multi-class classification

The metrics and evaluation method are the same as the process mentioned earlier. The comparison results are summarized in this table. A summary plot is also provided for easier interpretation.

Name		Name	Last commit message	Last commit date
Latest commit History 217 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
data		data
model		model
notebooks		notebooks
plots		plots
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
postBuild		postBuild
requirements.txt		requirements.txt

License

rpw199912j/mit_model_code

Folders and files

Latest commit

History

Repository files navigation