-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH avoid extraneous calls to set_data() #454
Comments
what is set_data doing in your case? can you do the precomputation in the
init/constructor and not with the set_data?
… Message ID: ***@***.***>
|
It fits a full lasso path to find a lambda such that a particular deviance ratio (R^2) is obtained. That would solve it, but I need access to X and y, which I don't have in the constructor. |
Can't you do it in the |
No, because I need access to the parameters to compute the path, and they're not available in get_data() unless I'm mistaken. |
Then, I think we should have an extra method for this. Maybe one point that might be interesting to raise: this feature is not so easy to implement with our current design ensuring that this does not get called multiple times because we aim for embarassingly parallel tasks i.e., each task should be able to run independently of all the others. This means that for every run, one should ensure that the heavy path computations have been made. The caching helps with this as we can say from the disk if a task has been run before or not. But even in this case, if 2 processes are launched simultaneously, both will need to compute the solution as it is not cached yet and they both need it. Another solution would be to do a graph of computation but this makes the code much more complicated so I am not sure we want to go down this road. This is also kind of related to #331. |
Good point! I guess there's no reason for why every user need to re-run this locally since the values should stay constant unless the data changes. Maybe the only good way to do this would be to have an extra script in the benchmark repo that runs these computations and stores these values in the repo somewhere and then refuses to run for this particular value of a parameter unless the necessary values are cached in the repo. It seems maybe a little too hacky for me. Maybe this is too much of an edge case to be worth catering to. |
This line
benchopt/benchopt/runner.py
Line 178 in bcbee8a
means that we re-run
set_data()
for each solver (plus additional calls torun()
for pre-compilation).We are currently working on a benchmark where the call to
set_data()
is computationally intensive, which ends up inflating the total amount of compute time orders of magnitude.I understand the need to run
skip()
for each solver, butset_data()
should produce exactly the same results for all the solvers, so we should not need to run it repeatedly, right?The text was updated successfully, but these errors were encountered: