ElMD compositional featurizer #726

sgbaird · 2021-11-20T07:18:37Z

Summary

Add featurization scheme for ElM2D and corresponding UMAP embeddings.

ElM2D distance computations
UMAP embeddings of ElM2D distances

TODO (if any)

Conform to Google Code Style for docs
Once the code is settled, move it into composition.py etc.
add dist_matrix as conda/pip dependency
deal with pandas and scipy dependencies

@ardunn
I spent a few hours refactoring my refactor of ElM2D (chem_wasserstein) to try to conform the code to be (at least closer) to matminer, but I'm realizing I may not be able to justify the time it will take to push the PR all the way through. I'm open to your feedback (especially after you've glanced at the basic skeleton of the changes), but I'm thinking it might be more reasonable for now to use chem_wasserstein as an external, optional dependency.

ardunn · 2021-11-23T02:43:26Z

Hi @sgbaird thanks for the PR!

This looks great and it is encouraging to see it basically in a working form already.

I think it is preferable to have the ElM2D code directly in matminer if it is not too complicated, and from briefly looking over the code it doesn't look like it is too complex to include (from the perspective of keeping matminer self contained and not too sprawling at least.. definitely not saying it is easy to port the code over from wasserstein haha)

We will need to make some changes to the locations of some of the classes and such to keep things organized but this is an excellent start if it is working on your end.

Here are some structural (not formatting) changes I think we'd need to make to take this PR to completion:

PCA class

Would it be possible to use the PCA class from sklearn instead?

Imports (`umap`, `dist_matrix`)

Would it be possible to import these within a try/except ImportError block such that it is clear what dependencies are needed in order to run the class?

ElM2D inheriting BaseFeaturizer

doesn't seem like ElM2D currently implements featurize, which it needs to. Featurize is just the simple operation of (once fitted), returning an array of features for one sample (i.e., one composition)
BaseFeaturizer overrides transform and fit_transform (if fit is defined), so those should probably be moved into featurize if possible.

Save and load

If possible, it would be best to save the variables to a more regular format like JSON or similar, rather than pickle.

Presets

If possible, it would be good to have some default "preset" options for people to load easily. See some of the other featurizers' from_preset methods to see what I mean. The idea being that people can load a "pre-fit" version which will work generally across use cases (again, if this is possible and makes sense)

If this seems like too much to be worth the effort, I think it would be fine to include the wasserstein code as an external dependency and include it in composition.external featurizers.

It would probably be best if the wasserstein code could be pip installable, but if not possible, we can still work with it by including it in the requirements as a git requirement directly. If we did go this way, I think the only things we'd need to implement are:

A basefeaturizer class wrapping the wasserstein code, implementing featurize and fit if needed
The featurizer can also implement from_preset if that is viable

Let me know what you think!

sgbaird · 2021-11-23T03:52:28Z

@ardunn, this is great!

Thanks for looking through the code. Using sklearn PCA shouldn't be a problem (technically I think it would be the MDS algorithm, i.e. distance-matrix based, even though it says PCA in a number of places). The imports, save/load, and presets also seem reasonable. For reference, I have a (refactored) version of ElM2D available on PyPI and Anaconda under the pseudonym chem_wasserstein to avoid conflict with the base, but based on what you've mentioned, I think it's worth pushing a bit further for the standalone matminer version as you mentioned. I also packaged dist_matrix on PyPI and Anaconda, so that dependency shouldn't be a problem. I feel pretty confident in the test suite I developed for dist_matrix as well.

Featurize

There are a few options for featurize and it wasn't immediately obvious to me which is the best one. It could either be:

the vectors used as inputs to the Wasserstein distance
a Wasserstein distance matrix
embedded UMAP coordinates based on the distance matrix

(1) isn't particularly interesting (encodes the composition and a machine-learned scalar representation of each element), but they are "features" in the strictest sense - a fixed compound always maps to the same vector.
(2) is probably the most versatile information available from ElM2D, but distance matrices aren't feature vectors, so I have a hard time imagining this being the output of featurize
(3) is really useful for clustering and visualization (i.e. the embedded coordinates cluster according to chemical similarity). This definitely fits into the category of (2D Cartesian) feature vectors in an embedding space, but the implementation would probably need to be adjusted as mentioned below.

Do you lean towards one of the above? I'll leave this one up to your knowledge of matminer.

Adjust UMAP embedding implementation

If we go with (3), it would probably be better (if not essential) for me to implement a version that allows fit to be called once, and then a transform that can be called repeatedly/reliably on new data (right now, since it depends on distance matrices, you have to feed in both "training" and "validation" data at the same time, hence the quotes). This is something I've been meaning to get to, so I don't consider this "extra" work necessarily. I came up with an easy workaround for the immediate application I had intended it for, so I'm a bit behind on this one.

ardunn · 2021-11-23T05:25:17Z

@sgbaird

External dependencies

For chem_wasserstein on PyPi, I think it's totally fine to just reference it as an external dependency and ElM2D can live in featurizers.composition.external.elm2d (it's own module). If you think you can adapt the code to be more along the structure of what's currently in matminer though, that's fine too - really your call and I'd approve a good implementation of either.

Features

I see two main ways of doing this:

Just have the featurizer output the vectors used as inputs, and include Wasserstein distance matrix/UMAP as totally separate functions that can easily work on the output of the featurizer as a dataframe.
Select between the 3 output types with an argument set in __init__, defining what you want to get out of the featurizer at the time you create the class. If vector is requested, it's easy to return just that. If distance matrix is requested, allow list of formulas as an extra argument to fit to use to compute the wasserstein distances. I.e., "I want to get the distances for each sample I'm featurizing with respect to this set of other formulas I fit on.". For example:

emd = ElM2DFeaturizer(method="distance")
emd.fit(X=df_i_want_to_featurize["composition"], comparison=other_df_want_to_compare_against["composition"])

The .fit syntax is pretty flexible for featurizers so feel free to get more creative with it as well

ardunn · 2022-01-13T05:23:42Z

Hey @sgbaird any thoughts on this? Should I close?

sgbaird · 2022-01-13T05:44:17Z

Sorry, I kind of dropped the ball on this. Thank you for the reminder! I think I'll reference it as an external dependency for now. I'll plan on getting this worked out in the next month. Will that work out?

ardunn · 2022-01-13T05:47:32Z

Sorry, I kind of dropped the ball on this. Thank you for the reminder! I think I'll reference it as an external dependency for now. I'll plan on getting this worked out in the next month. Will that work out?

Yeah no problem! We don't have any time restrictions or anything. Just wanted to see if it was still something you wanted to do :)

sgbaird added 5 commits May 18, 2021 18:47

Update data.py

47d782f

Merge branch 'master' of https://github.com/sparks-baird/matminer

3744f39

Create ElM2D_.py

b6c7751

Create ElM2D_check.csv

af487fc

Create test_ElM2D.py

2b1f6d8

sgbaird changed the title ~~Master~~ ElMD compositional featurizer Aug 4, 2022

sgbaird mentioned this pull request Aug 4, 2022

Consider use of fingerprint distance instead of StructureMatcher for comparison between generated and test sparks-baird/matbench-genmetrics#39

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ElMD compositional featurizer #726

ElMD compositional featurizer #726

sgbaird commented Nov 20, 2021

ardunn commented Nov 23, 2021

sgbaird commented Nov 23, 2021 •

edited

ardunn commented Nov 23, 2021

ardunn commented Jan 13, 2022

sgbaird commented Jan 13, 2022

ardunn commented Jan 13, 2022

ElMD compositional featurizer #726

Are you sure you want to change the base?

ElMD compositional featurizer #726

Conversation

sgbaird commented Nov 20, 2021

Summary

TODO (if any)

ardunn commented Nov 23, 2021

PCA class

Imports (umap, dist_matrix)

ElM2D inheriting BaseFeaturizer

Save and load

Presets

sgbaird commented Nov 23, 2021 • edited

Featurize

Adjust UMAP embedding implementation

ardunn commented Nov 23, 2021

External dependencies

Features

ardunn commented Jan 13, 2022

sgbaird commented Jan 13, 2022

ardunn commented Jan 13, 2022

Imports (`umap`, `dist_matrix`)

sgbaird commented Nov 23, 2021 •

edited