Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adhering to computation cost budget better #30

Open
Bronzila opened this issue Jun 26, 2023 · 3 comments
Open

Adhering to computation cost budget better #30

Bronzila opened this issue Jun 26, 2023 · 3 comments
Labels
bug Something isn't working enhancement New feature or request

Comments

@Bronzila
Copy link
Collaborator

Bronzila commented Jun 26, 2023

The current implementation waits for all started jobs when the runtime budget is exhausted. This does make sense when using function evaluations or number of iterations as budget, but not when specifying the maximum computation cost in seconds.

Toy failure mode:
The computational budget is 1h, but a new job, that would e.g. take 30 mins, is submitted after 59 mins of optimization. Then the optimizer would wait for this job to finish and therefore overshoot the maxmimum computational budget of 1h.

For now a quick fix could be simply stopping all workers when the runtime budget is exhausted, however this would result in potentially lost compute time. Therefore it might also be interesting to think of a way to checkpoint the optimizers state in order to resume training.

@Bronzila Bronzila added bug Something isn't working enhancement New feature or request labels Jun 26, 2023
@eddiebergman
Copy link
Contributor

You could look at this answer using sched: https://stackoverflow.com/a/474543/5332072

From there dask has a way to essentially shutdown() the Client and the close() it.

@Neeratyoy
Copy link
Collaborator

@eddiebergman what do we do with the interrupted evaluation?
assuming it is a deep learning model training as an evaluation, is it okay to still exceed the runtime to trigger saving the current state?
@Bronzila feel free to share your thoughts too

@eddiebergman
Copy link
Contributor

eddiebergman commented Jul 21, 2023

Based on a lookover, the "hot-loop" is here, with the break condition here:

if self._is_run_budget_exhausted(fevals, brackets, total_cost):
break


To return on time

I would probably do something along the lines of this for the dask case, this should basically kill all jobs running in dask and wait for all of them to return. This wait part isn't fulllly necessary but in principal it should be fine.

self.client.close()
for future in self.futures:
	future.cancel()
	
concurrent.futures.wait(self.futures, "ALL_COMPLETED")

Dask has the property that you can cancel running jobs, but in the non-dask case (here), where you're just raw dogging the function, you can't cancel it because it's in the same process. Killing it would mean killing the whole thing.

else:
# skipping scheduling to Dask worker to avoid added overheads in the synchronous case
self.futures.append(self._f_objective(job_info))

To circumvent this, you would need to run it in a subprocess of some kind and use psutil to effectively kill the process.


To inform the process so you can save

This is much harder, especially when you don't control the target function. The first thing you need is the handle of the process that is running the target function. Then you can send a SIGTERM to the process with .terminate()).

process = psutil.Process(<process-id of the thing to signal>)
process.terminate() 

The correct procedure here by OS standards is to cleanup the program and finish soon. The way to do this is to use pythons signal module, more over, this function:

import signal

def callback(signal_num, framestack) -> None:
    # ... cleanup, save a model, whatever

signal.signal(signal.SIGTERM, callback)

The tricky part is that users have to specify this, i.e. their target function is going to be called and this callback has to be registered once inside the process that is running the target function. I do not know how you'd like to do that. I think your best approach is simply give an example and move on. Trying to automatically handle this stuff would be a nightmare to do and maintain.

P.s.

This won't work if using a custom remote dask server, as you have no way to send a signal to this other machine running the process (or maybe dask does?), only if things are done with local processes. Perhaps dask has some unified way of handling this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants