Composition Analyzer/Featurizer (CAF) is a user-interactive Python script that offers tools for generating compositional features. It also provides interactive tools used for tasks such as filtering, sorting chemical formulas, and merging Excel files.
We developed this interactive tool to aid solid-state chemists and materials scientists in generating compositional training data ranging from dozens to tens of thousands of compounds. Both experimentalists and novices can use this tool to generate their own data with a basic understanding of Python. The codebase has been designed to allow beginners to customize it easily.
Execute the following command to begin.
python main.py
Then, a prompt will appear and enter a number to proceed.
Options:
1: Filter chemical formulas and generate periodic table heatmap
2: Sort chemical formulas in Excel
3: Create compositional features for formulas in Excel
4: Match .cif files in a folder against Excel
5: Merge two Excel files based on id/entry
Please enter the number of the option you want to run: 1
CAF provides 5 interactive options detailed below.
Option 1. Filter - offers analysis capabilities for chemical formulas already prepared in Excel sheets or a folder containing .cif files. It counts and identifies unique elements present, detects errors within the data, and generates a periodic table heatmap. Optionally, it provides two filtering methods: one for removing specific elements and another for categorizing compounds into unary, binary, ternary, and quaternary groups.
Option 2. Sort - rearranges a set of chemical formulas in an Excel file based on 3 options.
-
By label - Sorts elements by a pre-configured label for each element. This option is applicable only for binary and ternary compounds. You may modify the predefined set of elements in
data/label.xlsx
to add/remove any elements.# Binary compounds A: Fe, Co, Ni, Ru, Rh, Pd, Os, Ir, Pt B: Si, Ga, Ge, In, Sn, Sb # Ternary compounds R: Sc, Y, La, Ce, Py, Nd, Pm, Sm, Eu, Gd, Tb, Dy, Ho, Er, Tm, Yb, Lu, Th, U M: Fe, Co, Ni, Ru, Rh, Pd, Os, Ir, Pt X: Si, Ga, Ge, In, Sn, Sb
-
By index - sorts compounds by stoichiometric ratio in either increasing or decreasing order. If the index is the same, the formulas are sorted based on the Mendeleev number provided in
data/element_Mendeleev_numbers.xlsx
-
By property - sorts by the value of an elemental chemical property, choosing from 27 options available in the Oliynyk database. This sorting is limited to 33 elements provided in the Oliynyk elemental property Excel file located at
data/element_properties_for_ML.xlsx
.Available columns for sorting 1. Atomic weight 2. Atomic number 3. Period 4. Group 5. Mendeleev number 6. valence e total 7. unpaired electrons 8. Gilman no. of valence electrons 9. Zeff 10. Ionization energy (eV) 11. CN 12. ratio n closest/CN 13. polyhedron distortion (dmin/dn) 14. CIF radius element 15. Pauling, R(CN12) 16. Pauling EN 17. Martynov Batsanov EN 18. Melting point, K 19. Density, g/mL 20. Specific heat, J/g K 21. Cohesive energy 22. Bulk modulus, GPa 23. Abundance in Earth's crust 24. Abundance in solar system (log) 25. HHI production 26. HHI reserve 27. cost, pure ($/100g)
Option 3. Features - generates compositional features for formulas in an Excel file and, optionally, a composition-normalized vector using hot encoding. The database for the featurziation is based on the Oliynyk peer-reviewed data (DOI).
Options:
1: Filter chemical formulas and generate periodic table heatmap.
2: Sort chemical formulas in an Excel file
3: Create compositional features for formulas in an Excel file.
4: Match .cif files in a folder against an Excel file.
5: Merge two Excel files based on id/entry.
Please enter the number of the option you want to run: 3
Available Excel files with 'Formula' in the first column:
1. A-B database.xlsx
2. A-B features.xlsx
3. binary M-X.xlsx
4. binary_excel.xlsx
5. formulas.xlsx
Enter the number corresponding to the Excel file you wish to select: 5
Selected Excel file: /Users/imac/Downloads/CAF/formulas.xlsx
- For binary compounds, 133 binary features defined in
feature/binary.py
are generated and saved in an Excel file. - For ternary compounds, 204 ternary features from
feature/ternary.py
are generated for each formula and stored in an Excel file. - For all compounds, regardless of the formula type, a universal set of 112 sorted features and 156 unsorted (containing first and last element values) generated using
feature/universal.py
- (Optional) Extended features - You can generate an extensive list of features using
feature/universal_long.py
,feature/binary_long.py
, andfeature/ternary_long.py
. These files include arithmetic operations provided infeature/operation.py
. Any columns with overflow or undefined values are removed before the data is saved in Excel files. The column lengths can reach into the thousands.
Option 4. Match - matches .cif files in a folder against an Excel file
This option allows the user to select a folder containing .cif files and an Excel file. The Excel file should include an "Entry" column containing the IDs of the .cif files. The script parses the IDs from the .cif files and compares them with the entries in the Excel file. The Excel file is then updated to filter and display only the entries that match the existing .cif files.
Option 5. Merge - combines two Excel files based on an "Entry" column.
User selects two Excel files to merge based on a common column labeled "Entry". This feature is useful for combining datasets, such as one from a database and another generated using this CAF (Composition Analyzer/Featurizer) or SAF (Structure Analyzer/Featurizer).
git clone https://github.com/bobleesj/composition-analyzer-featurizer.git
cd composition-analyzer-featurizer
conda create -n cif python=3.12
conda activate cif
pip install -r requirements.txt
If you have any questions or any feedback, please feel free to email me at sl5400@columbia.edu. The code is currently actively developed and improved. Any feedback is appreciated.
- Alex Vtorov - feature
- Anton Oliynyk - project lead, ideation
- Danila Shiryaev - sort
- Emil Jaffal - filter
- Nikhil Kumar Barua - feature
- Sangjoon Bob Lee - development lead, integration
- 20240514 - Released repository public