Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for saving and loading simulation state to / from files #1227

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Conversation

matt-graham
Copy link
Collaborator

@matt-graham matt-graham commented Dec 11, 2023

Potentially resolves #86 though currently this doesn't deal with exposing this functionality with scenarios.

Adds new methods save_to_pickle and load_from_pickle that respectively save the current simulation state to a pickle file and load the simulation state from a pickle file (with the latter being a class method to allow using to directly load a simulation as Simulation.load_from_pickle. Pickling is dealt with by dill as this supports a much wider range of Python objects than built in pickle implementation. dill is added to package dependencies here, but the import is currently wrapped with logic to avoid ImportErrors in environments which do not have dill available (with save_to_pickle and load_from_pickle raising informative exceptions in this case).

The contents of the current Simulation.simulate method have also been factored out in to three separate methods Simulation.initialise, Simulation.run_simulation_to and Simulation.finalise that deal allow initialising, running and finalising the simulation separately, with Simulation.simulate retaining the same behaviour by just calling the three in sequence. This allows simulations to be partially run to an intermediate date before the simulation end date, saved to file and then reloaded and continued.

As there is also global state recorded in tlo.logging we also need to reconfigure logging on loading a simulation from file. I initially included some logic in load_from_pickle to inject loaded simulation in to tlo logger and set output file to previous logging path (logging FileHandler loaded by dill is not able to acquire a lock causing deadlocks if used), but decided it would be better to make this step explicit, particularly as I'd guess we would often want to write to a new log file when resuming a loaded simulation.

A set of tests that check for consistency of simulations when saving and loading from file, including using to resume a partially run simulation, are also added.

Apologies for the unrelated formatting changes src/tlo/simulation.py, I was using black to autoformat my changes and forgot it would also reformat rest of module - I can revert these bits if it makes a pain to review.

Questions

  • Instead of save_to_pickle and load_from_pickle and should we just use the names save and load?
  • Should we explicitly set simulation.output_file to None in Simulation.load_from_pickle to guard against accidental use of previous log file (and potential deadlock issues)?

@matt-graham
Copy link
Collaborator Author

Failing test in tests/test_malaria.py is due to #1230 which #1231 should fix

@tamuri
Copy link
Collaborator

tamuri commented Dec 12, 2023

As there is also global state recorded in tlo.logging we also need to reconfigure logging on loading a simulation from file. I initially included some logic in load_from_pickle to inject loaded simulation in to tlo logger and set output file to previous logging path (logging FileHandler loaded by dill is not able to acquire a lock causing deadlocks if used), but decided it would be better to make this step explicit, particularly as I'd guess we would often want to write to a new log file when resuming a loaded simulation.

Yes, we want new log files when restoring simulation because you'd potentially run many simulations from the same saved simulation. And they'd go in as separate Azure Batch runs, so different directories etc.

@tamuri
Copy link
Collaborator

tamuri commented Dec 18, 2023

I'm working on scenarios using this - will give comments in light of that.

@matt-graham
Copy link
Collaborator Author

@tamuri what would be the next steps to work on for this? I think you mentioned there was some issue with non-determinancy in logging your testing with this identified. Did you get anyhere with looking at how to use with scenarios, and is there something for me to pick up there?

@tamuri
Copy link
Collaborator

tamuri commented May 18, 2024

I've pushed my changes to scenario.py here, as well as three scenario files and a script to check log output matches.

@matt-graham
Copy link
Collaborator Author

I've started to look at differences that arise in logs formed from either a 'full' scenario run without any suspending or resuming, or the merged logs from a pair of suspended and resumed runs (but otherwise identical scenario settings), using the scenario files and script @tamuri added on the branch tamuri/suspend-restore-scenario.

From what I can see so far, most (possibly) of the discrepancies arise from bugs that also effect logging without suspent and resuming, specifically that the columns entry logged in the header message for a first log entry is not consistent with later log entries, because of one or more of

  1. Keys in (dict / dataframe) data logged being in different orders between first and subsequent calls to logger
  2. Keys in (dict / dataframe) data logged on first and subsequent calls to logger differ (this seems to typically be when logging a multi-index series generated by a group-by operation and converted to a dict, where not all combinations of the index are present on each log iteration, for example if grouping on age_years one or more specific age_years values may be missing for any given set of data)
  3. Values (types) associated with data logged on first and subsequent calls differ (this one is very common)

The ordering issues (1) are easy enough to resolve by always sorting the data dictionary by key before logging both the header and value messages.

The non-overlapping dict keys (2) will probably require manual fixing in each case as the current structured logging approach fundamentally relies on entries being alignable with each other.

For the non-constant column types (3) I am not sure what the implication is - often this is for example a value initially with int type being subsequently float, or bool being subsequently NoneType, along with some cases of types swapping between scalar types and lists (mainly in RTI module for the latter).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Saving to file simulations in a suspended state and resuming
2 participants