Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ETA On Compare Models #2428

Closed
rajatscibi opened this issue Apr 18, 2022 · 11 comments
Closed

ETA On Compare Models #2428

rajatscibi opened this issue Apr 18, 2022 · 11 comments
Labels
compare_models enhancement New feature or request

Comments

@rajatscibi
Copy link

rajatscibi commented Apr 18, 2022

Is your feature request related to a problem? Please describe.
Often when dataset is large or it has more number of features, the compare_models() functions takes a lot of time, sometimes hours on a regular 4GB RAM, SSD storage laptop without GPU. It's often not clear how much time it will take which makes it bit difficult to plan certain things and one may feel stuck.

Describe the solution you'd like
If we can have an ETA (estimated time remaining) on compare_models(), it will be very helpful.

Describe alternatives you've considered

Additional context

@rajatscibi rajatscibi added the enhancement New feature or request label Apr 18, 2022
@ngupta23
Copy link
Collaborator

ngupta23 commented Apr 18, 2022

I think this is not possible right now since there is no way to probe the underlying models on the amount of time remaining (limitation of scikit-learn).

@Yard1 - any comments on this request?

@rajatscibi
Copy link
Author

rajatscibi commented Apr 18, 2022

Found something here

@rajatscibi
Copy link
Author

There is something called as scitime as mentioned here

Scitime is a python package requiring at least python 3.6 with pandas, scikit-learn, psutil and joblib dependencies. You will find the Scitime repo here.

The main function in this package is called “time”. Given a matrix vector X, the estimated vector Y along with the Scikit Learn model of your choice, time will output both the estimated time and its confidence interval.

@ngupta23
Copy link
Collaborator

ngupta23 commented Apr 18, 2022

This seems to be quite an invasive method to compute run time. You are essentially building models around the model that you really want to build. Also, not sure this applies to all sklearn functionality such as cross_validate, GridSearchCV, etc.

Would prefer a method that is more tightly coupled with sklearn functionality. Also, would like to hear the thoughts of the other core developers as well.

@rajatscibi
Copy link
Author

Understandable. Lets keep this feature request open for now for someone else to come up with some idea on its execution.

@Yard1
Copy link
Member

Yard1 commented Apr 18, 2022

Yeah, none of the available solution would work here (or indeed, for all of our models).

@rajatscibi
Copy link
Author

Is it possible to estimate once the training has started and some time has elapsed. For example 10% got completed in 60 seconds, so approximately ETA to reach 100% can be around 600 seconds more or less?

@rajatscibi
Copy link
Author

Ill put this on sklearn repo as well and let see what they have to say about it. Sharing on stackoverflow as well for computer science community to suggest something.

@ngupta23
Copy link
Collaborator

ngupta23 commented Apr 18, 2022

Is it possible to estimate once the training has started and some time has elapsed. For example 10% got completed in 60 seconds, so approximately ETA to reach 100% can be around 600 seconds more or less?

This is not true since training is not a linear process.

Moreover, what do you mean by training is 10% complete? This may work in case of Deep Neural Networks where you have epochs, but that is not the case for machine learning models (for the most part).

@ngupta23
Copy link
Collaborator

Ill put this on sklearn repo as well and let see what they have to say about it. Sharing on stackoverflow as well for computer science community to suggest something.

This may be the best path forward. This should be handled in the sklearn repo itself rather than as a wrapper around the models outside sklearn. That would be the most sustainable way to do this (if it is even possible).

@rajatscibi
Copy link
Author

Reply from scikitlearn:

This is a recurrent feature request that will be resolved by the ongoing work on a callback API (#22000)

I'm closing this issue because it is a duplicate of #10973 and #78.

Link to post

@github-actions github-actions bot locked as resolved and limited conversation to collaborators May 1, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
compare_models enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants