Is there a way to save the model for future prediction? #172

econkc · 2018-02-03T18:32:06Z

Hi,

I wonder if there is a way to save the final model for future prediction? I understand that we can save a joblib object for tuning purpose that might be able to speed up the calculation but is there a way that we can just import the model back into python and use it to predict new data points without refitting the model. I am not sure if the "generate_prediction_data()" function is for this purpose and I cannot find a clear explanation of this function anywhere in the documentation.

Thanks,

lmcinnes · 2018-02-04T04:31:23Z

There is an approximate_predict function that can take a given model and made predictions for new data points. The generate_prediction_data needs to be run on the model before approximate_predict can work. You should be able to pickle a model and restore it later for predictions.

Now the caveat is that approximate_predict is just that -- it is an approximation based on the clusters already assigned. It will not necessarily give the same answer you would get if you added the new data points and re-clustered from scratch. Hopefully it fills your needs however.

econkc · 2018-02-04T05:38:14Z

I totally understand that predicting the cluster for new data is based on assumption that the cluster remains the same and so the approximate_predict function is exactly what I need.

So, to be clear, are the following steps correct?

fit the model (prediction_data = True) >> generate_prediction_data >> pickle the model >> make prediction later

I guess my question is which object should I pickle? Since I have already set option prediction_date = True, I believe that I don't need to run generate_prediction_data() afterward, is that correct? If I need to run the function can you give me some code example? Is it something like clusterer.generate_prediction_data()?

Thanks,

PS. I am very thankful for your contribution and always answer all the questions very quickly. Now many people in my Data Scientist team are aware of this model and they are all love it.

lmcinnes · 2018-02-04T17:11:46Z

The following code would work:

model = hdbscan.HDBSCAN(prediction_data=True).fit(data)
labels, membership_strengths = hdbscan.approximate_predict(model, new_data)

If you wanted to save the model to disk and then later with another script load it back up you would pickle to model to disk, then in the other script load the pickled model.

econkc · 2018-02-04T19:09:43Z

Thank you

omarhossam214 · 2023-02-28T19:00:11Z

yes you can use joblib

clusterer = hdbscan.HDBSCAN(min_cluster_size=50, prediction_data=True).fit(data) 
filename = 'model.joblib'
joblib.dump(clusterer, filename)

econkc closed this as completed Feb 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to save the model for future prediction? #172

Is there a way to save the model for future prediction? #172

econkc commented Feb 3, 2018

lmcinnes commented Feb 4, 2018

econkc commented Feb 4, 2018 •

edited

lmcinnes commented Feb 4, 2018

econkc commented Feb 4, 2018

omarhossam214 commented Feb 28, 2023

Is there a way to save the model for future prediction? #172

Is there a way to save the model for future prediction? #172

Comments

econkc commented Feb 3, 2018

lmcinnes commented Feb 4, 2018

econkc commented Feb 4, 2018 • edited

lmcinnes commented Feb 4, 2018

econkc commented Feb 4, 2018

omarhossam214 commented Feb 28, 2023

econkc commented Feb 4, 2018 •

edited