Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH avoid extraneous calls to set_data() #454

Open
jolars opened this issue Sep 27, 2022 · 6 comments · May be fixed by #466
Open

ENH avoid extraneous calls to set_data() #454

jolars opened this issue Sep 27, 2022 · 6 comments · May be fixed by #466

Comments

@jolars
Copy link
Contributor

jolars commented Sep 27, 2022

This line

skip, reason = objective.set_dataset(dataset)

means that we re-run set_data() for each solver (plus additional calls to run() for pre-compilation).

We are currently working on a benchmark where the call to set_data() is computationally intensive, which ends up inflating the total amount of compute time orders of magnitude.

I understand the need to run skip() for each solver, but set_data() should produce exactly the same results for all the solvers, so we should not need to run it repeatedly, right?

@agramfort
Copy link
Collaborator

agramfort commented Sep 27, 2022 via email

@jolars
Copy link
Contributor Author

jolars commented Sep 27, 2022

It fits a full lasso path to find a lambda such that a particular deviance ratio (R^2) is obtained.

That would solve it, but I need access to X and y, which I don't have in the constructor.

@tomMoral
Copy link
Member

Can't you do it in the get_data method of the Dataset? this should only gets called once.

@jolars
Copy link
Contributor Author

jolars commented Sep 27, 2022

No, because I need access to the parameters to compute the path, and they're not available in get_data() unless I'm mistaken.

@tomMoral
Copy link
Member

tomMoral commented Oct 2, 2022

Then, I think we should have an extra method for this.
Something like setup_data, that gets called after set_data only if it has not been called before or the data is different (for instance with caching)?

Maybe one point that might be interesting to raise: this feature is not so easy to implement with our current design ensuring that this does not get called multiple times because we aim for embarassingly parallel tasks i.e., each task should be able to run independently of all the others. This means that for every run, one should ensure that the heavy path computations have been made. The caching helps with this as we can say from the disk if a task has been run before or not. But even in this case, if 2 processes are launched simultaneously, both will need to compute the solution as it is not cached yet and they both need it. Another solution would be to do a graph of computation but this makes the code much more complicated so I am not sure we want to go down this road.

This is also kind of related to #331.

@jolars
Copy link
Contributor Author

jolars commented Oct 4, 2022

Good point! I guess there's no reason for why every user need to re-run this locally since the values should stay constant unless the data changes. Maybe the only good way to do this would be to have an extra script in the benchmark repo that runs these computations and stores these values in the repo somewhere and then refuses to run for this particular value of a parameter unless the necessary values are cached in the repo. It seems maybe a little too hacky for me. Maybe this is too much of an edge case to be worth catering to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants