You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Due to a known memory leak when instantiating subclasses of SymEngine (one of our upstream dependencies) Symbol objects (see symengine/symengine.py#379), running ESPEI with parallelization will cause memory to grow in each worker.
Only running in parallel will trigger significant memory growth, because running in parallel uses the pickle library to serialize and deserialize symbol objects and create new objects that can't be freed. When running without parallelization (mcmc.scheduler: null), new symbols are not created.
Until symengine/symengine.py#379 is fixed, some mitigation strategies to avoid running out of memory are:
Run ESPEI without parallelization by setting scheduler: null
(Under consideration to implement): when parallelization is active, use an option to restart the workers every N iterations.
(Under consideration to implement): remove Model objects from the keyword arguments of ESPEI's likelihood functions. Model objects contribute a lot of symbol instances in the form of v.SiteFraction objects. We should be able to get away with only using PhaseRecord objects, but there are a few places Model.constituents to be able to infer the sublattice model and internal degrees of freedom that would need to be rewritten.
The text was updated successfully, but these errors were encountered:
Code to restart the workers would look something like applying this patch to opt_mcmc.py, this would hard code the restart interval, but it could be added as a parameter and propagated through:
diff --git a/espei/optimizers/opt_mcmc.py b/espei/optimizers/opt_mcmc.py
index 4d08ccf..c167e9e 100644
--- a/espei/optimizers/opt_mcmc.py+++ b/espei/optimizers/opt_mcmc.py@@ -167,12 +167,19 @@ class EmceeOptimizer(OptimizerBase):
def do_sampling(self, chains, iterations):
progbar_width = 30
_log.info('Running MCMC for %s iterations.', iterations)
+ scheduler_restart_interval = 50 # iterations
try:
for i, result in enumerate(self.sampler.sample(chains, iterations=iterations)):
# progress bar
if (i + 1) % self.save_interval == 0:
self.save_sampler_state()
_log.trace('Acceptance ratios for parameters: %s', self.sampler.acceptance_fraction)
+ if (self.scheduler is not None) and ((i + 1) % scheduler_restart_interval == 0):+ # Note: resetting the scheduler will reset the logger settings for the workers+ # You'd typically want to run the following, but the verbosity/filename are out of scope+ # self.scheduler.run(espei.logger.config_logger, verbosity=log_verbosity, filename=log_filename)+ self.scheduler.restart()+ _log.info(f"Restarting scheduler (interval = {scheduler_restart_interval} iterations)")
n = int((progbar_width) * float(i + 1) / iterations)
_log.info("\r[%s%s] (%d of %d)\n", '#' * n, ' ' * (progbar_width - n), i + 1, iterations)
except KeyboardInterrupt:
Due to a known memory leak when instantiating subclasses of SymEngine (one of our upstream dependencies)
Symbol
objects (see symengine/symengine.py#379), running ESPEI with parallelization will cause memory to grow in each worker.Only running in parallel will trigger significant memory growth, because running in parallel uses the
pickle
library to serialize and deserialize symbol objects and create new objects that can't be freed. When running without parallelization (mcmc.scheduler: null
), new symbols are not created.Until symengine/symengine.py#379 is fixed, some mitigation strategies to avoid running out of memory are:
scheduler: null
N
iterations.Model
objects from the keyword arguments of ESPEI's likelihood functions. Model objects contribute a lot of symbol instances in the form ofv.SiteFraction
objects. We should be able to get away with only usingPhaseRecord
objects, but there are a few placesModel.constituents
to be able to infer the sublattice model and internal degrees of freedom that would need to be rewritten.The text was updated successfully, but these errors were encountered: