Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Is there a way to save the model for future prediction? #172

Closed
econkc opened this issue Feb 3, 2018 · 5 comments
Closed

Is there a way to save the model for future prediction? #172

econkc opened this issue Feb 3, 2018 · 5 comments

Comments

@econkc
Copy link

econkc commented Feb 3, 2018

Hi,

I wonder if there is a way to save the final model for future prediction? I understand that we can save a joblib object for tuning purpose that might be able to speed up the calculation but is there a way that we can just import the model back into python and use it to predict new data points without refitting the model. I am not sure if the "generate_prediction_data()" function is for this purpose and I cannot find a clear explanation of this function anywhere in the documentation.

Thanks,

@lmcinnes
Copy link
Collaborator

lmcinnes commented Feb 4, 2018

There is an approximate_predict function that can take a given model and made predictions for new data points. The generate_prediction_data needs to be run on the model before approximate_predict can work. You should be able to pickle a model and restore it later for predictions.

Now the caveat is that approximate_predict is just that -- it is an approximation based on the clusters already assigned. It will not necessarily give the same answer you would get if you added the new data points and re-clustered from scratch. Hopefully it fills your needs however.

@econkc
Copy link
Author

econkc commented Feb 4, 2018

I totally understand that predicting the cluster for new data is based on assumption that the cluster remains the same and so the approximate_predict function is exactly what I need.

So, to be clear, are the following steps correct?

fit the model (prediction_data = True) >> generate_prediction_data >> pickle the model >> make prediction later

I guess my question is which object should I pickle? Since I have already set option prediction_date = True, I believe that I don't need to run generate_prediction_data() afterward, is that correct? If I need to run the function can you give me some code example? Is it something like clusterer.generate_prediction_data()?

Thanks,

PS. I am very thankful for your contribution and always answer all the questions very quickly. Now many people in my Data Scientist team are aware of this model and they are all love it.

@lmcinnes
Copy link
Collaborator

lmcinnes commented Feb 4, 2018

The following code would work:

model = hdbscan.HDBSCAN(prediction_data=True).fit(data)
labels, membership_strengths = hdbscan.approximate_predict(model, new_data)

If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.

@econkc
Copy link
Author

econkc commented Feb 4, 2018

Thank you

@econkc econkc closed this as completed Feb 4, 2018
@omarhossam214
Copy link

yes you can use joblib

clusterer = hdbscan.HDBSCAN(min_cluster_size=50, prediction_data=True).fit(data) 
filename = 'model.joblib'
joblib.dump(clusterer, filename)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants