Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023) #3027

Open
wants to merge 7 commits into
base: develop
Choose a base branch
from

Conversation

bbbales2
Copy link
Member

Submission Checklist

  • Run unit tests: ./runTests.py src/test/unit
  • Run cpplint: make cpplint
  • Declare copyright holder and open-source license: see below

Summary

Fix for #3023

Side Effects

I changed how the return codes from covar_adaptation::learn_covariance and var_adaptation::learn_variance worked.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

@bbbales2
Copy link
Member Author

@funko-unko want to verify this fixes the problem?

To get a copy of cmdstan with this patch do:

git clone --recursive https://github.com/stan-dev/cmdstan cmdstan-issue-3023
cd cmstan-issue-3023/stan
git checkout bugfix/issue-3023
cd ../
make build

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.47 3.42 1.01 1.43% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.98 -2.46% slower
eight_schools/eight_schools.stan 0.11 0.11 0.96 -3.67% slower
gp_regr/gp_regr.stan 0.16 0.16 1.0 -0.5% slower
irt_2pl/irt_2pl.stan 5.31 5.23 1.01 1.42% faster
performance.compilation 91.43 88.53 1.03 3.17% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 9.01 8.89 1.01 1.3% faster
pkpd/one_comp_mm_elim_abs.stan 31.32 30.09 1.04 3.91% faster
sir/sir.stan 130.41 131.79 0.99 -1.06% slower
gp_regr/gen_gp_data.stan 0.04 0.04 0.97 -3.6% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.13 3.15 0.99 -0.61% slower
pkpd/sim_one_comp_mm_elim_abs.stan 0.37 0.39 0.94 -6.04% slower
arK/arK.stan 1.91 1.9 1.01 0.65% faster
arma/arma.stan 0.63 0.71 0.89 -12.22% slower
garch/garch.stan 0.52 0.51 1.02 2.17% faster
Mean result: 0.99084398021

Jenkins Console Log
Blue Ocean
Commit hash: 4b0ae48


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@funko-unko
Copy link

I'll try it out later.

@nhuurre
Copy link
Contributor

nhuurre commented Mar 20, 2021

Changing the return code means that stepsize adaptation does not restart after the last metric window. The last window is usually longer than term_buffer (the defaults give last window size 500 compared to term_buffer=50). Does that have some bad side effects?

I think it would be more conservative to simply change the restart value of x_bar_.

void restart() {
counter_ = 0;
s_bar_ = 0;
x_bar_ = 0;
}

The first call to learn_stepsize() overwrites x_bar_ anyway so the initial value here is irrelevant unless you call complete_adaptation() immediately without learn_stepsize() in between (i.e. term_buffer = 0), in which case the stepsize becomes exp(x_bar_).
The sampler sets mu_ = log(10*stepsize) right before restart() so the natural restart value for x_bar_ would be
mu_ - log(10).

@bbbales2
Copy link
Member Author

Is the trick that:

x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x;

1.0 - x_eta is always zero on the first iteration cause counter is always 1?

I think the annoying thing here is that we deal with the magic constant 10 in two places. I wouldn't mind getting rid of the 10x and restarting at x_bar_ equals mu_ re: this

does not restart after the last metric window.

I think this should still restart the step size adaptation if there is a term buffer following. At least that was the intention. If there is no term buffer, there is no reset.

The logic to detect the end of the last window with a term buffer is:

void compute_next_window() {
    if (adapt_next_window_ == num_warmup_ - adapt_term_buffer_ - 1)
      return;

And the finished thing here is detecting instead:

bool finished() {
    if (adapt_window_counter_ + 1 >= num_warmup_) {
      return true;

@nhuurre
Copy link
Contributor

nhuurre commented Mar 20, 2021

1.0 - x_eta is always zero on the first iteration cause counter is always 1?

Yes

I think the annoying thing here is that we deal with the magic constant 10 in two places.

No, not two.

this->stepsize_adaptation_.set_mu(log(10 * this->nom_epsilon_));

this->stepsize_adaptation_.set_mu(log(10 * this->nom_epsilon_));

sampler.get_stepsize_adaptation().set_mu(log(10 * stepsize));

sampler.get_stepsize_adaptation().set_mu(log(10 * stepsize));

(And that's just nuts, there's also static_hmc)

I wouldn't mind getting rid of the 10x

Related to issue #2670

That 10x isn't the only janky thing about this. init_stepsize() seems to hardcode adapt_delta=0.8

int direction = delta_H > std::log(0.8) ? 1 : -1;

(and shouldn't that be abs(delta_H) > abs(std::log(0.8))?

I think the "correct" stepsize multiplier when changing windows is probably something like the largest eigenvalue of (old metric / new metric).

I think this should still restart the step size adaptation if there is a term buffer following.

Ah, you're right. adapt_window_counter_ doesn't reach num_warmup_ if term_buffer_ > 0. Nevermind then.

@betanalpha
Copy link
Contributor

The current behavior is intended.

After the final windowed adaptation update there is little guidance on the step size for the terminal buffer window. The step size is initialized to 1, which is roughly near the value for a multivariate normal target of low dimension if the metric has been adapted perfectly. More generally, however, the optimal step size for sampling purposes is not simply related to the eigenvalues of M^{-1} Covar(target) and hence the scaling in the optimal step size is not simply related the eigenvalues of M_{new}^{-1} M_{old}.

The dual averaging adaptation sets mu to 10 times the initial epsilon to encourage more aggressive adaptation early on; this is a common heuristic for setting mu when optimization.

The objective of the step size initialization is only to find a rough value and the precise adaptation target is not especially important. In particular the initialization considers the acceptance probability at one step which is not characteristic of the acceptance probability averaged over the entire trajectory (either uniformly weighted as is currently implemented or properly weighted as it should be implemented).

Because there is not great way to predict a good step size after updating the inverse metric a non-zero terminal buffer window was always intended to be run. Given that users are running without a terminal buffer window a message warning users that the step size will not be adapted with term_buffer = 0 would certainly be helpful.

@funko-unko
Copy link

I don't know whether I'm the only one doing this, but please don't issue a warning and please report the last used step size.

Thanks.

@bbbales2
Copy link
Member Author

The current behavior is intended.

Well that doesn't mean we can't change it.

I don't really see the importance of setting this to 1.0. When we continue adaptation we're setting it to 10x the old timestep anyway. If anyone wants to use the stepsize 1 for something, they can just use stepsize 1.

Sure the adaptation might be bad, but the behavior is confusing, stepsize = 1 will probably be bad too, and this is an easy to fix.

@yizhang-yiz
Copy link

yizhang-yiz commented Mar 24, 2021

I agree with @bbbales2 & @funko-unko that despite it's not a practice we want to recommend if someone wants to set term_buffer=0 the outcome of stepsize=1 violates the least surprise principle.

@funko-unko
Copy link

My use case is of course to use only the first two adaptation windows on a series of converging/contracting posteriors, and only for the last one to use term_buffer!=0, as described here https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko

@nhuurre
Copy link
Contributor

nhuurre commented Mar 27, 2021

The step size is initialized to 1, which is roughly near the value for a multivariate normal target of low dimension if the metric has been adapted perfectly. More generally, however, the optimal step size for sampling purposes is not simply related to the eigenvalues of M^{-1} Covar(target) and hence the scaling in the optimal step size is not simply related the eigenvalues of M_{new}^{-1} M_{old}.

I thought more about this because it feels like a multivariate normal target with perfect adaptation should be simple enough to solve analytically. Working with small stepsize approximation I got fairly strong bounds for single-step accept probability but of course that may not be indicative of longer trajectory or more complicated target distribution.
Notes attached if someone wants to check my math.
stepsize-bounds.pdf

@bbbales2
Copy link
Member Author

@betanalpha Any further thoughts on this? I would like to make this change.

Copy link

@yizhang-yiz yizhang-yiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To directly address #3023 , can we add a test to call service function like hmc_nuts_dense_e_adapt with term_buffer=0 and check adapted stepsize?

src/stan/mcmc/windowed_adaptation.hpp Outdated Show resolved Hide resolved
src/test/unit/mcmc/covar_adaptation_test.cpp Show resolved Hide resolved
@funko-unko
Copy link

Oh, right, I'll have a look whenever my current computation finishes...

@bbbales2
Copy link
Member Author

bbbales2 commented Apr 3, 2021

To directly address #3023

I added a direct test. It seems a little weird without context (like someone coming along might not understand why the stepsize should not be 1), so I'm down to add a comment about why it is there, but it serves the purpose.

Btw don't approve or merge on this until @betanalpha okays it or we just disagree and vote on it. Thanks for the review though.

@stan-buildbot
Copy link
Contributor


Name Old Result New Result Ratio Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan 3.39 3.34 1.01 1.34% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan 0.02 0.02 0.99 -1.04% slower
eight_schools/eight_schools.stan 0.12 0.11 1.04 4.03% faster
gp_regr/gp_regr.stan 0.16 0.16 1.01 0.61% faster
irt_2pl/irt_2pl.stan 5.95 6.0 0.99 -0.74% slower
performance.compilation 91.11 89.51 1.02 1.76% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan 8.63 8.74 0.99 -1.31% slower
pkpd/one_comp_mm_elim_abs.stan 28.63 28.54 1.0 0.32% faster
sir/sir.stan 123.0 136.59 0.9 -11.05% slower
gp_regr/gen_gp_data.stan 0.03 0.04 0.98 -1.68% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan 3.12 3.06 1.02 1.96% faster
pkpd/sim_one_comp_mm_elim_abs.stan 0.4 0.4 1.0 -0.2% slower
arK/arK.stan 1.87 1.88 1.0 -0.37% slower
arma/arma.stan 0.85 0.63 1.34 25.4% faster
garch/garch.stan 0.6 0.56 1.08 7.42% faster
Mean result: 1.02476505228

Jenkins Console Log
Blue Ocean
Commit hash: 40ee1c2


Machine information ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

@betanalpha
Copy link
Contributor

betanalpha commented Apr 5, 2021 via email

@nhuurre
Copy link
Contributor

nhuurre commented Apr 5, 2021

After each adaptation window the inverse metric is updated with the aggregated information and the step size is reset to a default near 1

Actually, the stepsize does not reset to one between the windows. Instead the algorithm invokes init_stepsize() which doubles or halves the stepsize a few times, aiming for Langevin step accept probability in the neighbourhood of 0.8.
If you have term_window >= 1 this is the stepsize you get.
The "surprising" reset-to-one behaviour is the result of how stepsize adapter restarts. The initial value of x_bar_ is set to zero instead of log of the current stepsize. The initial value is essentially irrelevant because the dual averaging gives it zero weight; the first real update erases it completely--unless the window is empty and no update ever happens.

Is the jump from || . ||_{-} to | | correct here? L^{-1} M^{-1} isn’t positive definite and so the distribution of acceptance probabilities shouldn’t be symmetric. Indeed even in the normal case the Hamiltonian error doesn’t oscillate symmetrically around zero but rather is biased towards larger energies.

I'm only using the leading order epsilon^3 term for the single-step energy error. That term is linear in p and the distribution of p is certainly symmetric. The bias you note is of order epsilon^4 so it's negligible in this approximation. (The approximation may well be too crude to be interesting.)

@bbbales2
Copy link
Member Author

bbbales2 commented Apr 6, 2021

step size is reset to a default near 1 because there is no other decent guess

As far as I know the initial guess for the stepsize in the next interval is 10x the last one.

in misconceptions about how the intermediate adaptation windows work.

Nah nah I don't think it's a misconception -- I see what you're saying here (after a metric is updated, because the stepsize corresponds to the previous metric we'd want to update it).

I don't think that changing the behavior for term_buffer = 0 is going to compromise anything. Like sure the sampler is in a dubious state, but the user requested term_buffer = 0 so, yeah, that's intentional, and they're following up with more adaptation where the stepsize can change (and doing a non-zero term-buffer at the end of it all).

(The approximation may well be too crude to be interesting.)

I do like the idea of not doing the 10x, which seems pretty crude.

@funko-unko how are you currently working around this?

@bbbales2
Copy link
Member Author

bbbales2 commented Apr 6, 2021

Oh yeah I guess -- is everyone here comfortable if we just try to do a vote on this? Like, the SGB rules say votes for major issues and this seems quite small, but I'm happy just manufacturing a decision and rolling with it rather than arguing back and forth too long.

If I'm stopping the discussion too early just say so.

Niko's timestep adaptation seems interesting re: the discussions here + the thing here so either way this PR goes we can still keep that it mind.

@nhuurre
Copy link
Contributor

nhuurre commented Apr 6, 2021

As far as I know the initial guess for the stepsize in the next interval is 10x the last one.

It's not. See this code

this->init_stepsize(logger);
this->stepsize_adaptation_.set_mu(log(10 * this->nom_epsilon_));

The call to init_stepsize() adjusts the stepsize between the intervals and set_mu does not set the initial guess but some kind of shrinkage target.

is everyone here comfortable if we just try to do a vote on this?

I don't think we're at the point where people agree to disagree. It looks more like people can't even agree what the current behaviour is.

I made this plot with init_buffer=20 term_buffer=20 window=30. Here epsilon is the current stepsize, mu is the shrinkage target stepsize and xbar is the stepsize we would have if adaptation stopped right now.

adapt-stepsize

  • epsilon moves continuously
  • mu is constant over the whole window, fixed at 10x the first epsilon.
  • xbar resets to 1.0 between windows but then immediately jumps to meet epsilon. Upthread I proposed changing that last part--just make xbar restart at epsilon.

@funko-unko
Copy link

@bbbales2

@funko-unko how are you currently working around this?

Currently I'm just using the step sizes used for the last sample as extracted from the output csv to pass to the next warmup step. (Actually currently I compute the mean across chains, but really I haven't settled on anything yet).

@betanalpha

Did you find any documentation explaining that this is a valid use case of the adaptation?

No, do you think this is an invalid use case? Is there any metric that could answer that question?

@bbbales2
Copy link
Member Author

bbbales2 commented Apr 6, 2021

set_mu does not set the initial guess but some kind of shrinkage target.

I'll have to read up on what the dual averaging thing does. I'm clueless here. Thanks for the plot.

Currently I'm just using the step sizes used for the last sample as extracted from the output csv

So this is the stepsize pre-stepsize reset, right? So you're able to get the thing you need it's just kinda inconvenient?

@nhuurre
Copy link
Contributor

nhuurre commented Apr 6, 2021

term_buffer = 0 was included in case the adaption windows were turned off. In other words if one wanted to run only an initial buffer window to learn a step size but keep the inverse metric at its initial value (either the identify or whatever is specified externally).

I double-checked and that's not supported either. Regardless of whether window is zero or not, term_buffer=0 causes the adaptive sampler to reset stepsize to 1. And window=0 also forces the metric to 0.001 times the identity matrix. This PR does not fix the metric reset issue.

@funko-unko
Copy link

So this is the stepsize pre-stepsize reset, right?

I think so, yes.

So you're able to get the thing you need it's just kinda inconvenient?

I wouldn't say I get what I need, as I do not know what I need. Using a stepsize of one did appear to work. Using the chainwise last step size appeared to work slightly better. Using the mean/median across chains appeared to work slightly better still. Using the maximal step size across chains worked well sometimes, badly other times.

@betanalpha
Copy link
Contributor

betanalpha commented Apr 15, 2021 via email

@nhuurre
Copy link
Contributor

nhuurre commented Apr 16, 2021

when adapt_base_window_ = 0 the window termination code is called after every iteration between adapt_init_buffer and adapt_term_buffer_ which continuously restarts the step size adaptation

Not quite. adapt_next_window_ is not the first iteration of the next window but the last iteration of the current window. When adapt_base_window_ = 0 it's always the last iteration of the init buffer. There is only one reset if init_buffer > 0 and no reset at all if init_buffer = 0. In effect, the term buffer extends to fill the void left by the missing metric adaptation windows.
I do like your quick fix. It makes it clear that the behavior is intentional and also prevents the reset at the end of the init buffer. That solves the same problem as #3037 in a cleaner manner.

The initial step size can either be used to set x_bar_ = std::log(epsilon) instead of zero, which changes the optimization behavior

The initial value of x_bar_ does not change the optimization behavior. learn_stepsize() updates x_bar_ according to

x_eta = std::pow(counter_, -kappa_);
x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x;

The first update has counter_ = 1. Regardless of the value of kappa_ the weight (1.0 - x_eta) is 0.0 and any finite initial value of x_bar_ is completely erased.

I did notice a couple of places where the optimization behavior probably should be changed. (or at least doesn't make sense to me)

  1. The first is visible in the plot I posted upthread.
    adapt-stepsize
    You can see that mu_ (the green line) is higher in the init buffer than in the term buffer even though the init buffer starts with a smaller stepsize than the term buffer. What's going on, isn't mu_ supposed to be 10 times the initial stepsize?
    The answer is that for the initial window mu_ is set before the first init_stepsize() but for later windows mu_ resets after init_stepsize(). That seems inconsistent. The stepsize is not very good before init_stepsize().

  2. This discussion has been concerned with the "surprising" result when term_buffer = 0. But let's take a quick look at term_buffer = 1. Of course the step size then depends on what the accept_stat__ is for that single adaptation transition but I think that if the accept_stat__ just happens to equal adapt_delta then the "least surprising" result is that the step size does not change. It might not be exactly optimal but there's no clear reason to move it in either direction, right?
    The current behavior in that scenario is to set the step size to exp(mu_), i.e. 10 times the initial stepsize which feels a bit too large.
    There's an easy fix: make restart() initialize s_bar_ = gamma*log(10) instead of the current s_bar_ = 0.
    The most obvious side effect is removing the high spike right after every metric update.
    stepsize-trace

  3. The algorithm for updating the step size during adaptation is (n is the iteration number (counter_))

eta   <- 1/(t0 + n)
s_bar <- s_bar*(1-eta) + eta*(adapt_delta - accept_stat)
x     <- mu - s_bar * eta*sqrt(n)/gamma
epsilon <- exp(x)

Alternatively this can be written as

x <- mu + (x - mu)*(1-1/(t0+n))*sqrt(n/(n-1)) - (adapt_delta - accept_stat)*sqrt(n)/(gamma*(t0+n))

The last term is identical to the adaptation scheme in section 3.2 of the Hoffman and Gelman (2014) NUTS paper.
The first term just regularizes the stepsize towards mu.
But wait, the multiplier (1-1/(t0+n))*sqrt(n/(n-1)) is larger than 1.0 when n < t0 so for the first ten iterations it pushes the stepsize away from mu rather than pulling towards mu. That can't be right, can it?

@betanalpha
Copy link
Contributor

betanalpha commented May 3, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants