Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database format #7

Open
shyuep opened this issue Mar 10, 2020 · 1 comment
Open

Database format #7

shyuep opened this issue Mar 10, 2020 · 1 comment

Comments

@shyuep
Copy link
Collaborator

shyuep commented Mar 10, 2020

A database is a good idea. But I think we should try to use something widely supported. We can even support a few options. Any recommendations? The obvious ones are hdf5 and json and mysql. MongoDB is probably too heavy duty, though it can be an option since the translation to json is easy.

@chc273
Copy link
Contributor

chc273 commented Mar 10, 2020

For model saving, currently we use json, hdf5 and pickle.

Json is mainly for saving the configurational parameters, which can be obtained by as_dict or get_params (sklearn method).

pickle and hdf5 are used to save the states of the model. The states are for example model weights that are not provided in __init__, but rather computed using training data. So far, we support two types of models/packages, namely sklearn and keras. For sklearn models, the official weight-saving method is using pickle, and the sklearn model provides __getstate__ and __setstate__ API for working with pickle format. keras/tensorflow deep learning models, on the other hand use hdf5. If more model types/packages (e.g., lightgbm, xgboost, pytorch) will be used, I think pickle and hdf5 may be adapted to work with them. We will find out as we go further.

The database part is used to ease the model training process and increase the reproducibility of the models. So far, I think MongoDB is better suited for this task compared to mysql, since any data or model results will be highly heterogeneous. We also have prior experience with MongoDB. Unless we find something better, we can use MongoDB for now. This is not a core function to maml though. It is more of a tool to help curating data, building/saving model and storing model results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants