ENH avoid extraneous calls to set_data() #454

jolars · 2022-09-27T08:39:33Z

This line

Line 178 in bcbee8a

skip, reason = objective.set_dataset(dataset)

means that we re-run set_data() for each solver (plus additional calls to run() for pre-compilation).

We are currently working on a benchmark where the call to set_data() is computationally intensive, which ends up inflating the total amount of compute time orders of magnitude.

I understand the need to run skip() for each solver, but set_data() should produce exactly the same results for all the solvers, so we should not need to run it repeatedly, right?

The text was updated successfully, but these errors were encountered:

agramfort · 2022-09-27T08:57:08Z

what is set_data doing in your case? can you do the precomputation in the init/constructor and not with the set_data?

…

Message ID: ***@***.***>

jolars · 2022-09-27T09:06:13Z

It fits a full lasso path to find a lambda such that a particular deviance ratio (R^2) is obtained.

That would solve it, but I need access to X and y, which I don't have in the constructor.

tomMoral · 2022-09-27T11:22:46Z

Can't you do it in the get_data method of the Dataset? this should only gets called once.

jolars · 2022-09-27T17:28:51Z

No, because I need access to the parameters to compute the path, and they're not available in get_data() unless I'm mistaken.

tomMoral · 2022-10-02T07:20:12Z

Then, I think we should have an extra method for this.
Something like setup_data, that gets called after set_data only if it has not been called before or the data is different (for instance with caching)?

Maybe one point that might be interesting to raise: this feature is not so easy to implement with our current design ensuring that this does not get called multiple times because we aim for embarassingly parallel tasks i.e., each task should be able to run independently of all the others. This means that for every run, one should ensure that the heavy path computations have been made. The caching helps with this as we can say from the disk if a task has been run before or not. But even in this case, if 2 processes are launched simultaneously, both will need to compute the solution as it is not cached yet and they both need it. Another solution would be to do a graph of computation but this makes the code much more complicated so I am not sure we want to go down this road.

This is also kind of related to #331.

jolars · 2022-10-04T08:01:08Z

Good point! I guess there's no reason for why every user need to re-run this locally since the values should stay constant unless the data changes. Maybe the only good way to do this would be to have an extra script in the benchmark repo that runs these computations and stores these values in the repo somewhere and then refuses to run for this particular value of a parameter unless the necessary values are cached in the repo. It seems maybe a little too hacky for me. Maybe this is too much of an edge case to be worth catering to.

jolars mentioned this issue Oct 10, 2022

refactor: use glmnet instead of sklearn for path benchopt/benchmark_lasso_path#20

Merged

jolars linked a pull request Nov 7, 2022 that will close this issue

ENH: Only call set_data() once per dataset #466

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH avoid extraneous calls to set_data() #454

ENH avoid extraneous calls to set_data() #454

jolars commented Sep 27, 2022

agramfort commented Sep 27, 2022 via email

jolars commented Sep 27, 2022

tomMoral commented Sep 27, 2022

jolars commented Sep 27, 2022

tomMoral commented Oct 2, 2022

jolars commented Oct 4, 2022

ENH avoid extraneous calls to set_data() #454

ENH avoid extraneous calls to set_data() #454

Comments

jolars commented Sep 27, 2022

agramfort commented Sep 27, 2022 via email

jolars commented Sep 27, 2022

tomMoral commented Sep 27, 2022

jolars commented Sep 27, 2022

tomMoral commented Oct 2, 2022

jolars commented Oct 4, 2022