Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving trained models and their metadata for inference and reproducibility #41

Open
felker opened this issue Dec 5, 2019 · 1 comment
Assignees

Comments

@felker
Copy link
Member

felker commented Dec 5, 2019

Following discussion on Wednesday 2019-12-04 in FRNN group meeting in San Diego, we need to start systematically saving the best trained models for:

  1. Collaboration (no need for multiple users to waste GPU hours retraining the same models)
  2. Practical inference (@mdboyer wants a Python interface derived from performance_analysis.py that would allow a user to load a trained model and easily feed a set of shot(s) for inference, without using the bloated shot list and preprocessing pipeline that has been oriented towards training for the first phase of the project. Would enable exploratory studies about proximity to disruption, UQ, clustering, etc. This is an important intermediate step to setting up the C-based real-time inference tool in the PCS. )
  3. Reproducibility

As a part of a broader effort towards improving reproducibility of our workflow, these models should be stored with:

  • .h5 file containing the tunable parameters (can be directly loaded by Keras or C-translated inference software)
  • Input configuration conf.yaml and/or dumped final configuration used in specifying and training the model
  • Output performance metrics of the trained model (train/validate/test ROC)
  • Normalization .npz pickled class. For VarNormalizer, this would only consist of the standard deviations of each channel of each signal from the set of shots used to train the normalizer. However, it is serialized and saved as a "fat" class object that requires the entire plasma module to load. Might want to dump a simple non-pickled array, or even .txt, alongside the pickle, so that we have a simple file to load with the Keras-C wrapper.
  • Some metadata about the layout of a preprocessed shot in processed_shots/signal_group_*/*.npz (order of channels and signals, sampling rates, thresholding? etc.), so that any real-time inference wrapper could apply a similar preprocessing to the incoming data.
  • Exact individual shot numbers used in the training, validation, and testing sets, so that anyone using the model for inference will know if the shot being supplied to the model has already been used to train the model.
  • SHA1 of Git commit
  • Conda environment; versions of dependencies such as TensorFlow, Keras, PyTorch, scikit-learn
  • Computer used for training, MPI library, CuDNN library, etc.
  • Number of devices and MPI ranks used in training (least important)

Given the binary .h5 and .npz files, we probably don't want to use VCS to store everything. But we might want to version control the plain-text metadata about the trained models. Store in this repository alongside the code? Or a new repository under our GitHub organization?

Also, should we consider ONNX?

@felker felker self-assigned this Dec 5, 2019
@mdboyer
Copy link

mdboyer commented Dec 6, 2019

Initially maybe archive both ONNX and h5 since we may use either for PCS deployment.

I'd advocate saving normalization as txt/h5 instead of npz to facilitate reading by PCS.

Better yet, could the normalization just be added as a layer to the model post-training so it is saved in the ONNX/H5 file? This would make implementation of the inference even simpler since the unnormalized data could be used as input to the deployed model.

Text files for signals names would also be easier for use in PCS.

I would think having some example trained models in the main repo would be useful, but maybe a larger library of models could be maintained separately?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants