Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inference file writing #198

Closed
saraloo opened this issue Apr 2, 2024 · 5 comments · May be fixed by #203
Closed

Inference file writing #198

saraloo opened this issue Apr 2, 2024 · 5 comments · May be fixed by #203
Labels
inference Concerns the parameter inference framework

Comments

@saraloo
Copy link
Contributor

saraloo commented Apr 2, 2024

Inference is writing so many files for each iteration that it uses up too much space. This causes issues when running batch runs in particular as space fills up.

We should add an option to not save all these iterations if we want. Not sure if there's a way to still get all the inference information we want to see evolution of parameters or likelihoods, but limit the number of files, or if we just want to add the option to turn this off and not worry about knowing this information.

@saraloo saraloo added the inference Concerns the parameter inference framework label Apr 2, 2024
@saraloo
Copy link
Contributor Author

saraloo commented Apr 3, 2024

Fixing these things in https://github.com/HopkinsIDD/flepiMoP/tree/breaking-improvements-fixsaving

From Shaun: We need to do a couple things, some already started:

  • fix initial conditions to not resave full seir every iteration for every chain --> DONE
    • I have now fixed this so it filters to only the start date of the config -- this is what gempyor does too. -- this will substantially reduce the storage burden, but not fix it
  • change the gempyor code in seeding_ic.py to not read a new file from what was defined by the config ever time. is this what it's actually doing? It should read from the init files after the first setup run (run by initialize_mcmc_first_block() ) when running inference.
  • fix it so it's not saving files for every iteration, especially SEIR files

@saraloo
Copy link
Contributor Author

saraloo commented Apr 3, 2024

From Alison some more notes:

  • Re SEIR files: We have to decide when/if we want intermediate files saved. Here are the options
    • for seir/chimeric/intermediate files - should almost never be saved, they have no meaning (there is no actual simulating corresponding to this output - it’s taking timeseries outputs from different simulations for each state depending on if they had a chimeric acceptance or rejection this iteration). they are not currently being saved except for at the start of a block, which may be necessary, i don’t understand the interblock file stuff
    • for seir/global/intermediate files - options are a) currently - always saving, so we can see the model fits for any iteration of the MCMC chain, but takes up a lot of space. b) save only on global acceptances. then we can always see model fits for any iteration as whenever there is a rejection you just look at the last accepted value to see value that iteration. This will reduce file size but only to about 20-50% of current value. c) never save - then we can never see the model fits for intermediate iterations, but we could always re-run the model with the parameters saved each iteration. and ideally the variables of interest should be in hosp files anyways. Doing (c) this will dramatically reduce file space.
    • Note that if we do (b) - save only global acceptances, we should probably use this rule for ALL global intermediate files (ie snpi, hnpi, init etc). When we make plots of parameter traces during MCMC evolution, we can easily just plot the last accepted value for iterations that had rejections.
  • Re initial conditions: We have to decide when/if we want intermediate initial conditions files saved. Here are the options:
    • for init/chimeric/intermediate files: a) always save these, since sometimes initial conditions are perturbed. b) only save these if initial conditions are perturbed. (b) is inconsistent with how we handle other file types - for example hnpi files are saved each iteration independent of whether they are perturbed or not. In either case, we should make sure that we do what Shaun said above - if initial conditions are taken from a prior run’s SEIR file (ie, method FromFile), we need to first filter only for the date being used for initial conditions, not all the dates in that file.
    • for init/global/intermediate files: a) currently - always save, so we can see initial conditions for every iteration. b) save only on global acceptances. then we can always see initial conditions for any iteration as whenever there is a rejection you just look at the last accepted value to see value that iteration. This will reduce file size to about 20-50% of current value. c) only save these if initial conditions are perturbed, but as for chimeric above this is inconsistent with how we handle other file types.

Include a toggle controlled by an input argument to turn this saving on or off

@jcblemai
Copy link
Collaborator

jcblemai commented Apr 4, 2024

blocks are an automatically scheduled resume. I agree we should try to keep that (not easy, not so useful on slurm clusters)

hange the gempyor code in seeding_ic.py to not read a new file from what was defined by the config ever time -- is this what it's actually doing?

Do you mean always reading the first index (00000001 or 0000000) instead of following up the index ?
It is not necessary for gempyor to save SEIR files (it doesn't on emcee), will work in making that default for inference-
The way it works now on emcee is that the user specifies:

nwalkers = 256 # This is equivalent to slots
niter = 400    # lenght of the chains
nsamples = 100 # number of likelihood eval to write to disk...
thin=5         # how to thin the chain to produce the samples. Every "thin" iteration will be taken into account

Then it'll run 400 iterations of emcee without writing anything on disk but an HDF5 dataset of the chain (just the accepted parameters that are fitted, very compact -- updated in a crash-safe way at every iteration). Then it produces 100 samples that it writes fully to disk (seir, hosp and all). Since in this example there are 256 slots it will randomly choose some of them. But if I had asked for 1000 samples instead, it would have produced 256 samples from the last iteration (of all 256 slots), and the next 256 samples would be taken from the fifth last (cause thin=5, to avoid too correlated samples) iteration, then the tenth last iteration and all. I am also planning on making it compatible with our scheme.

This is very convenient because I can stop and restart the run as I want/when it crashes. We never got around to doing it with our classical scheme. It's also intuitive to specify the number of samples and to plot from a single file.
Perhaps we could add /mcmc/ and /plot/ as file types in model_output, for plots and chains.

Now for our old inference -- I think we should distinguish the file operation needed to communicate between gempyor and R, and the one to save a chain on disk. It's very confusing how conflated these are at the moment.

  • For communication: Perhaps we could have gempyor always reading from the same files (overwritten by inference, e.g the index 00000000) and writing the same files at a certain index.
  • For chains: Then inference would read the gempyor files, and choose whether to store them or not according to our current naming scheme. From Alison message, I'd agree (and suggest c for seir/global/intermediate. For initial conditions I'm not really sure. I'd keep track of the full chimeric chain and just the global accept perhaps ?

Do we have the machinery to fit initial condition on the inference side ?

@jcblemai
Copy link
Collaborator

Some work on that in #199

@jcblemai jcblemai linked a pull request Apr 16, 2024 that will close this issue
15 tasks
@alsnhll
Copy link
Collaborator

alsnhll commented Apr 22, 2024

The main issues here re inference file saving are fixed with #205. The later discussion about how/where gempyor saves files and where R looks for them hasn't been addressed but is not the main part of this issue anyways

@alsnhll alsnhll closed this as completed Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
inference Concerns the parameter inference framework
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants