Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023) #3027

bbbales2 · 2021-03-19T15:36:32Z

Submission Checklist

Run unit tests: ./runTests.py src/test/unit
Run cpplint: make cpplint
Declare copyright holder and open-source license: see below

Summary

Fix for #3023

Side Effects

I changed how the return codes from covar_adaptation::learn_covariance and var_adaptation::learn_variance worked.

Copyright and Licensing

Please list the copyright holder for the work you are submitting (this will be you or your assignee, such as a university or company): Columbia University

By submitting this pull request, the copyright holder is agreeing to license the submitted work under the following licenses:

Code: BSD 3-clause (https://opensource.org/licenses/BSD-3-Clause)
Documentation: CC-BY 4.0 (https://creativecommons.org/licenses/by/4.0/)

bbbales2 · 2021-03-19T15:40:03Z

@funko-unko want to verify this fixes the problem?

To get a copy of cmdstan with this patch do:

git clone --recursive https://github.com/stan-dev/cmdstan cmdstan-issue-3023
cd cmstan-issue-3023/stan
git checkout bugfix/issue-3023
cd ../
make build

…4.1 (tags/RELEASE_600/final)

stan-buildbot · 2021-03-19T17:20:24Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.47	3.42	1.01	1.43% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.98	-2.46% slower
eight_schools/eight_schools.stan	0.11	0.11	0.96	-3.67% slower
gp_regr/gp_regr.stan	0.16	0.16	1.0	-0.5% slower
irt_2pl/irt_2pl.stan	5.31	5.23	1.01	1.42% faster
performance.compilation	91.43	88.53	1.03	3.17% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	9.01	8.89	1.01	1.3% faster
pkpd/one_comp_mm_elim_abs.stan	31.32	30.09	1.04	3.91% faster
sir/sir.stan	130.41	131.79	0.99	-1.06% slower
gp_regr/gen_gp_data.stan	0.04	0.04	0.97	-3.6% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.13	3.15	0.99	-0.61% slower
pkpd/sim_one_comp_mm_elim_abs.stan	0.37	0.39	0.94	-6.04% slower
arK/arK.stan	1.91	1.9	1.01	0.65% faster
arma/arma.stan	0.63	0.71	0.89	-12.22% slower
garch/garch.stan	0.52	0.51	1.02	2.17% faster
Mean result: 0.99084398021

Jenkins Console Log
Blue Ocean
Commit hash: 4b0ae48

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

funko-unko · 2021-03-19T19:23:39Z

I'll try it out later.

nhuurre · 2021-03-20T16:55:40Z

Changing the return code means that stepsize adaptation does not restart after the last metric window. The last window is usually longer than term_buffer (the defaults give last window size 500 compared to term_buffer=50). Does that have some bad side effects?

I think it would be more conservative to simply change the restart value of x_bar_.

stan/src/stan/mcmc/stepsize_adaptation.hpp

Lines 49 to 53 in 37ad8c3

    
           void restart() { 
        
             counter_ = 0; 
        
             s_bar_ = 0; 
        
             x_bar_ = 0; 
        
           }

The first call to learn_stepsize() overwrites x_bar_ anyway so the initial value here is irrelevant unless you call complete_adaptation() immediately without learn_stepsize() in between (i.e. term_buffer = 0), in which case the stepsize becomes exp(x_bar_).
The sampler sets mu_ = log(10*stepsize) right before restart() so the natural restart value for x_bar_ would be
mu_ - log(10).

bbbales2 · 2021-03-20T17:25:37Z

Is the trick that:

x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x;

1.0 - x_eta is always zero on the first iteration cause counter is always 1?

I think the annoying thing here is that we deal with the magic constant 10 in two places. I wouldn't mind getting rid of the 10x and restarting at x_bar_ equals mu_ re: this

does not restart after the last metric window.

I think this should still restart the step size adaptation if there is a term buffer following. At least that was the intention. If there is no term buffer, there is no reset.

The logic to detect the end of the last window with a term buffer is:

void compute_next_window() {
    if (adapt_next_window_ == num_warmup_ - adapt_term_buffer_ - 1)
      return;

And the finished thing here is detecting instead:

bool finished() {
    if (adapt_window_counter_ + 1 >= num_warmup_) {
      return true;

nhuurre · 2021-03-20T18:04:41Z

1.0 - x_eta is always zero on the first iteration cause counter is always 1?

Yes

I think the annoying thing here is that we deal with the magic constant 10 in two places.

No, not two.

stan/src/stan/mcmc/hmc/nuts/adapt_dense_e_nuts.hpp

Line 38 in 37ad8c3

this->stepsize_adaptation_.set_mu(log(10 * this->nom_epsilon_));

stan/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp

Line 38 in 37ad8c3

this->stepsize_adaptation_.set_mu(log(10 * this->nom_epsilon_));

stan/src/stan/services/sample/hmc_nuts_dense_e_adapt.hpp

Line 90 in 37ad8c3

sampler.get_stepsize_adaptation().set_mu(log(10 * stepsize));

stan/src/stan/services/sample/hmc_nuts_diag_e_adapt.hpp

Line 89 in 37ad8c3

sampler.get_stepsize_adaptation().set_mu(log(10 * stepsize));

(And that's just nuts, there's also static_hmc)

I wouldn't mind getting rid of the 10x

Related to issue #2670

That 10x isn't the only janky thing about this. init_stepsize() seems to hardcode adapt_delta=0.8

stan/src/stan/mcmc/hmc/base_hmc.hpp

Line 103 in 37ad8c3

int direction = delta_H > std::log(0.8) ? 1 : -1;

(and shouldn't that be abs(delta_H) > abs(std::log(0.8))?

I think the "correct" stepsize multiplier when changing windows is probably something like the largest eigenvalue of (old metric / new metric).

I think this should still restart the step size adaptation if there is a term buffer following.

Ah, you're right. adapt_window_counter_ doesn't reach num_warmup_ if term_buffer_ > 0. Nevermind then.

betanalpha · 2021-03-23T19:43:01Z

The current behavior is intended.

After the final windowed adaptation update there is little guidance on the step size for the terminal buffer window. The step size is initialized to 1, which is roughly near the value for a multivariate normal target of low dimension if the metric has been adapted perfectly. More generally, however, the optimal step size for sampling purposes is not simply related to the eigenvalues of M^{-1} Covar(target) and hence the scaling in the optimal step size is not simply related the eigenvalues of M_{new}^{-1} M_{old}.

The dual averaging adaptation sets mu to 10 times the initial epsilon to encourage more aggressive adaptation early on; this is a common heuristic for setting mu when optimization.

The objective of the step size initialization is only to find a rough value and the precise adaptation target is not especially important. In particular the initialization considers the acceptance probability at one step which is not characteristic of the acceptance probability averaged over the entire trajectory (either uniformly weighted as is currently implemented or properly weighted as it should be implemented).

Because there is not great way to predict a good step size after updating the inverse metric a non-zero terminal buffer window was always intended to be run. Given that users are running without a terminal buffer window a message warning users that the step size will not be adapted with term_buffer = 0 would certainly be helpful.

funko-unko · 2021-03-23T21:07:42Z

I don't know whether I'm the only one doing this, but please don't issue a warning and please report the last used step size.

Thanks.

bbbales2 · 2021-03-23T23:19:03Z

The current behavior is intended.

Well that doesn't mean we can't change it.

I don't really see the importance of setting this to 1.0. When we continue adaptation we're setting it to 10x the old timestep anyway. If anyone wants to use the stepsize 1 for something, they can just use stepsize 1.

Sure the adaptation might be bad, but the behavior is confusing, stepsize = 1 will probably be bad too, and this is an easy to fix.

yizhang-yiz · 2021-03-24T18:02:31Z

I agree with @bbbales2 & @funko-unko that despite it's not a practice we want to recommend if someone wants to set term_buffer=0 the outcome of stepsize=1 violates the least surprise principle.

funko-unko · 2021-03-25T06:48:50Z

My use case is of course to use only the first two adaptation windows on a series of converging/contracting posteriors, and only for the last one to use term_buffer!=0, as described here https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko

nhuurre · 2021-03-27T10:51:38Z

The step size is initialized to 1, which is roughly near the value for a multivariate normal target of low dimension if the metric has been adapted perfectly. More generally, however, the optimal step size for sampling purposes is not simply related to the eigenvalues of M^{-1} Covar(target) and hence the scaling in the optimal step size is not simply related the eigenvalues of M_{new}^{-1} M_{old}.

I thought more about this because it feels like a multivariate normal target with perfect adaptation should be simple enough to solve analytically. Working with small stepsize approximation I got fairly strong bounds for single-step accept probability but of course that may not be indicative of longer trajectory or more complicated target distribution.
Notes attached if someone wants to check my math.
stepsize-bounds.pdf

bbbales2 · 2021-03-30T22:03:56Z

@betanalpha Any further thoughts on this? I would like to make this change.

yizhang-yiz

To directly address #3023 , can we add a test to call service function like hmc_nuts_dense_e_adapt with term_buffer=0 and check adapted stepsize?

src/stan/mcmc/windowed_adaptation.hpp

src/test/unit/mcmc/covar_adaptation_test.cpp

funko-unko · 2021-03-31T08:54:54Z

Oh, right, I'll have a look whenever my current computation finishes...

…4.1 (tags/RELEASE_600/final)

bbbales2 · 2021-04-03T20:53:41Z

To directly address #3023

I added a direct test. It seems a little weird without context (like someone coming along might not understand why the stepsize should not be 1), so I'm down to add a comment about why it is there, but it serves the purpose.

Btw don't approve or merge on this until @betanalpha okays it or we just disagree and vote on it. Thanks for the review though.

stan-buildbot · 2021-04-03T21:37:04Z

Name	Old Result	New Result	Ratio	Performance change( 1 - new / old )
gp_pois_regr/gp_pois_regr.stan	3.39	3.34	1.01	1.34% faster
low_dim_corr_gauss/low_dim_corr_gauss.stan	0.02	0.02	0.99	-1.04% slower
eight_schools/eight_schools.stan	0.12	0.11	1.04	4.03% faster
gp_regr/gp_regr.stan	0.16	0.16	1.01	0.61% faster
irt_2pl/irt_2pl.stan	5.95	6.0	0.99	-0.74% slower
performance.compilation	91.11	89.51	1.02	1.76% faster
low_dim_gauss_mix_collapse/low_dim_gauss_mix_collapse.stan	8.63	8.74	0.99	-1.31% slower
pkpd/one_comp_mm_elim_abs.stan	28.63	28.54	1.0	0.32% faster
sir/sir.stan	123.0	136.59	0.9	-11.05% slower
gp_regr/gen_gp_data.stan	0.03	0.04	0.98	-1.68% slower
low_dim_gauss_mix/low_dim_gauss_mix.stan	3.12	3.06	1.02	1.96% faster
pkpd/sim_one_comp_mm_elim_abs.stan	0.4	0.4	1.0	-0.2% slower
arK/arK.stan	1.87	1.88	1.0	-0.37% slower
arma/arma.stan	0.85	0.63	1.34	25.4% faster
garch/garch.stan	0.6	0.56	1.08	7.42% faster
Mean result: 1.02476505228

Jenkins Console Log
Blue Ocean
Commit hash: 40ee1c2

Machine information

ProductName: Mac OS X ProductVersion: 10.11.6 BuildVersion: 15G22010

CPU:
Intel(R) Xeon(R) CPU E5-1680 v2 @ 3.00GHz

G++:
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/usr/include/c++/4.2.1
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

Clang:
Apple LLVM version 7.0.2 (clang-700.1.81)
Target: x86_64-apple-darwin15.6.0
Thread model: posix

betanalpha · 2021-04-05T16:54:28Z

Let me reiterate the intended behavior of the adaptation. The init buffer gives the sampler time to find the typical set, adapting the step size along the way to facilitate the search. The intermediate adaptation windows try to aggregate information about the global covariance structure. In these windows the inverse metric is not adapted dynamically, which is prone to unstable behavior, but the step size is. After each adaptation window the inverse metric is updated with the aggregated information and the step size is reset to a default near 1 _because there is no other decent guess_. Because of this compartmentalization the intermediate windows _never leave the sampler configuration in a good state_. This isn’t a problem when applying another intermediate window because the step size is adapted within those windows, but it is when the final intermediate window has completed. The entire role of the terminal buffer is to learn a final stepwise and bring the sampler back into a good state. In other words term_buffer = 0 is not appropriate whenever there are intermediate adaptation windows. term_buffer = 0 was included in case the adaption windows were turned off. In other words if one wanted to run only an initial buffer window to learn a step size but keep the inverse metric at its initial value (either the identify or whatever is specified externally). A warning if term_buffer = 0 but window != 0, as well as additional documentation, is entirely appropriate here. Trying to hack the adaptation to allow fragile, unintended use cases is not.

I agree with @bbbales2 <https://github.com/bbbales2> & @funko-unko <https://github.com/funko-unko> that despite it's not practice a we want to recommend but if someone wants to set term_buffer=0 the outcome of stepsize=1 violates the least surprise principle <https://en.wikipedia.org/wiki/Principle_of_least_astonishment>.

The problem with the least surprise principle in general is that it’s subjective; surprise will always be defined by prior expectations which are not uniform across individuals. For example here the problem isn’t in the behavior of the terminal adaptation buffer but rather in misconceptions about how the _intermediate_ adaptation windows work.

My use case is of course to use only the first two adaptation windows on a series of converging/contracting posteriors, and only for the last one to use term_buffer!=0, as described here https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko <https://discourse.mc-stan.org/t/fitting-ode-models-best-efficient-practices/21018/60?u=funko_unko>Did you find any documentation explaining that this is a valid use case of the adaptation? I thought more about this because it feels like a multivariate normal target with perfect adaptation should be simple enough to solve analytically. Working with small stepsize approximation I got fairly strong bounds for single-step accept probability but of course that may not be indicative of longer trajectory or more complicated target distribution. Notes attached if someone wants to check my math.

The exact result can be worked out by generalizing the derivation in A.3 of https://arxiv.org/abs/1411.6669 to include non-identify covariance and/or metric. Is the jump from || . ||_{-} to | | correct here? L^{-1} M^{-1} isn’t positive definite and so the distribution of acceptance probabilities shouldn’t be symmetric. Indeed even in the normal case the Hamiltonian error doesn’t oscillate symmetrically around zero but rather is biased towards larger energies. In any case as you note the limitation here is that the result holds only for one short step, but that short step recovers the Langevin limit. The optimal Langevin acceptance probability and step size are not the same for Hamiltonian Monte Carlo with nontrivial integration times.

nhuurre · 2021-04-05T18:30:50Z

After each adaptation window the inverse metric is updated with the aggregated information and the step size is reset to a default near 1

Actually, the stepsize does not reset to one between the windows. Instead the algorithm invokes init_stepsize() which doubles or halves the stepsize a few times, aiming for Langevin step accept probability in the neighbourhood of 0.8.
If you have term_window >= 1 this is the stepsize you get.
The "surprising" reset-to-one behaviour is the result of how stepsize adapter restarts. The initial value of x_bar_ is set to zero instead of log of the current stepsize. The initial value is essentially irrelevant because the dual averaging gives it zero weight; the first real update erases it completely--unless the window is empty and no update ever happens.

Is the jump from || . ||_{-} to | | correct here? L^{-1} M^{-1} isn’t positive definite and so the distribution of acceptance probabilities shouldn’t be symmetric. Indeed even in the normal case the Hamiltonian error doesn’t oscillate symmetrically around zero but rather is biased towards larger energies.

I'm only using the leading order epsilon^3 term for the single-step energy error. That term is linear in p and the distribution of p is certainly symmetric. The bias you note is of order epsilon^4 so it's negligible in this approximation. (The approximation may well be too crude to be interesting.)

bbbales2 · 2021-04-06T04:12:51Z

step size is reset to a default near 1 because there is no other decent guess

As far as I know the initial guess for the stepsize in the next interval is 10x the last one.

in misconceptions about how the intermediate adaptation windows work.

Nah nah I don't think it's a misconception -- I see what you're saying here (after a metric is updated, because the stepsize corresponds to the previous metric we'd want to update it).

I don't think that changing the behavior for term_buffer = 0 is going to compromise anything. Like sure the sampler is in a dubious state, but the user requested term_buffer = 0 so, yeah, that's intentional, and they're following up with more adaptation where the stepsize can change (and doing a non-zero term-buffer at the end of it all).

(The approximation may well be too crude to be interesting.)

I do like the idea of not doing the 10x, which seems pretty crude.

@funko-unko how are you currently working around this?

bbbales2 · 2021-04-06T04:30:17Z

Oh yeah I guess -- is everyone here comfortable if we just try to do a vote on this? Like, the SGB rules say votes for major issues and this seems quite small, but I'm happy just manufacturing a decision and rolling with it rather than arguing back and forth too long.

If I'm stopping the discussion too early just say so.

Niko's timestep adaptation seems interesting re: the discussions here + the thing here so either way this PR goes we can still keep that it mind.

nhuurre · 2021-04-06T07:50:43Z

As far as I know the initial guess for the stepsize in the next interval is 10x the last one.

It's not. See this code

stan/src/stan/mcmc/hmc/nuts/adapt_diag_e_nuts.hpp

Lines 36 to 38 in e064647

    
           this->init_stepsize(logger); 
        
           this->stepsize_adaptation_.set_mu(log(10 * this->nom_epsilon_));

The call to init_stepsize() adjusts the stepsize between the intervals and set_mu does not set the initial guess but some kind of shrinkage target.

is everyone here comfortable if we just try to do a vote on this?

I don't think we're at the point where people agree to disagree. It looks more like people can't even agree what the current behaviour is.

I made this plot with init_buffer=20 term_buffer=20 window=30. Here epsilon is the current stepsize, mu is the shrinkage target stepsize and xbar is the stepsize we would have if adaptation stopped right now.

epsilon moves continuously
mu is constant over the whole window, fixed at 10x the first epsilon.
xbar resets to 1.0 between windows but then immediately jumps to meet epsilon. Upthread I proposed changing that last part--just make xbar restart at epsilon.

funko-unko · 2021-04-06T10:21:26Z

@bbbales2

@funko-unko how are you currently working around this?

Currently I'm just using the step sizes used for the last sample as extracted from the output csv to pass to the next warmup step. (Actually currently I compute the mean across chains, but really I haven't settled on anything yet).

@betanalpha

Did you find any documentation explaining that this is a valid use case of the adaptation?

No, do you think this is an invalid use case? Is there any metric that could answer that question?

bbbales2 · 2021-04-06T15:13:35Z

set_mu does not set the initial guess but some kind of shrinkage target.

I'll have to read up on what the dual averaging thing does. I'm clueless here. Thanks for the plot.

Currently I'm just using the step sizes used for the last sample as extracted from the output csv

So this is the stepsize pre-stepsize reset, right? So you're able to get the thing you need it's just kinda inconvenient?

nhuurre · 2021-04-06T15:42:16Z

term_buffer = 0 was included in case the adaption windows were turned off. In other words if one wanted to run only an initial buffer window to learn a step size but keep the inverse metric at its initial value (either the identify or whatever is specified externally).

I double-checked and that's not supported either. Regardless of whether window is zero or not, term_buffer=0 causes the adaptive sampler to reset stepsize to 1. And window=0 also forces the metric to 0.001 times the identity matrix. This PR does not fix the metric reset issue.

funko-unko · 2021-04-07T09:54:24Z

So this is the stepsize pre-stepsize reset, right?

I think so, yes.

So you're able to get the thing you need it's just kinda inconvenient?

I wouldn't say I get what I need, as I do not know what I need. Using a stepsize of one did appear to work. Using the chainwise last step size appeared to work slightly better. Using the mean/median across chains appeared to work slightly better still. Using the maximal step size across chains worked well sometimes, badly other times.

betanalpha · 2021-04-15T18:23:47Z

I double-checked and that's not supported either. Regardless of whether window is zero or not, term_buffer=0 causes the adaptive sampler to reset stepsize to 1. And window=0 also forces the metric to 0.001 times the identity matrix. This PR does not fix the metric reset issue.

There are two issues here. One is the behavior of dual averaging for vanishing window sizes (the term_buffer = 0 problem). This can resolved in two ways, both of which require a second `restart` signature with an initial step size argument. The initial step size can either be used to set `x_bar_ = std::log(epsilon)` instead of zero, which changes the optimization behavior and will have to be tested externally, or it can be saved as a member variable with `complete_adaptation` updated to something like ``` void complete_adaptation(double& epsilon) { if (counter_ > 0) epsilon = initial_epsilon_; } ``` which does not change the optimization behavior. The other is that when `adapt_base_window_ = 0` the window termination code is called after every iteration between `adapt_init_buffer` and `adapt_term_buffer_` which continuously restarts the step size adaptation and tries to update the inverse metric. In some sense the behavior with `adapt_base_window = 0` and `adapt_init_buffer + adapt_term_buffer < n_warmup` is undefined, but having any iterations between `adapt_init_buffer` and `num_warmup - adapt_term_buffer` no op is probably the best option. The quick fix here is changing `end_adaptation_window` mcmc/windowed_adaptation.hpp from ``` bool end_adaptation_window() { return (adapt_window_counter_ == adapt_next_window_) && (adapt_window_counter_ != num_warmup_); } ``` to ``` bool end_adaptation_window() { return (adapt_window_counter_ == adapt_next_window_) && (adapt_window_counter_ != num_warmup_) && (adapt_base_window_ != 0); } ``` and adding a warning to `set_window_params` like “WARNING: Because base_window = 0 only a single step size adaptation will be performed across the all warmup transitions."

nhuurre · 2021-04-16T18:15:38Z

when adapt_base_window_ = 0 the window termination code is called after every iteration between adapt_init_buffer and adapt_term_buffer_ which continuously restarts the step size adaptation

Not quite. adapt_next_window_ is not the first iteration of the next window but the last iteration of the current window. When adapt_base_window_ = 0 it's always the last iteration of the init buffer. There is only one reset if init_buffer > 0 and no reset at all if init_buffer = 0. In effect, the term buffer extends to fill the void left by the missing metric adaptation windows.
I do like your quick fix. It makes it clear that the behavior is intentional and also prevents the reset at the end of the init buffer. That solves the same problem as #3037 in a cleaner manner.

The initial step size can either be used to set x_bar_ = std::log(epsilon) instead of zero, which changes the optimization behavior

The initial value of x_bar_ does not change the optimization behavior. learn_stepsize() updates x_bar_ according to

x_eta = std::pow(counter_, -kappa_);
x_bar_ = (1.0 - x_eta) * x_bar_ + x_eta * x;

The first update has counter_ = 1. Regardless of the value of kappa_ the weight (1.0 - x_eta) is 0.0 and any finite initial value of x_bar_ is completely erased.

I did notice a couple of places where the optimization behavior probably should be changed. (or at least doesn't make sense to me)

The first is visible in the plot I posted upthread.

You can see that mu_ (the green line) is higher in the init buffer than in the term buffer even though the init buffer starts with a smaller stepsize than the term buffer. What's going on, isn't mu_ supposed to be 10 times the initial stepsize?
The answer is that for the initial window mu_ is set before the first init_stepsize() but for later windows mu_ resets after init_stepsize(). That seems inconsistent. The stepsize is not very good before init_stepsize().
This discussion has been concerned with the "surprising" result when term_buffer = 0. But let's take a quick look at term_buffer = 1. Of course the step size then depends on what the accept_stat__ is for that single adaptation transition but I think that if the accept_stat__ just happens to equal adapt_delta then the "least surprising" result is that the step size does not change. It might not be exactly optimal but there's no clear reason to move it in either direction, right?
The current behavior in that scenario is to set the step size to exp(mu_), i.e. 10 times the initial stepsize which feels a bit too large.
There's an easy fix: make restart() initialize s_bar_ = gamma*log(10) instead of the current s_bar_ = 0.
The most obvious side effect is removing the high spike right after every metric update.
The algorithm for updating the step size during adaptation is (n is the iteration number (counter_))

eta   <- 1/(t0 + n)
s_bar <- s_bar*(1-eta) + eta*(adapt_delta - accept_stat)
x     <- mu - s_bar * eta*sqrt(n)/gamma
epsilon <- exp(x)

Alternatively this can be written as

x <- mu + (x - mu)*(1-1/(t0+n))*sqrt(n/(n-1)) - (adapt_delta - accept_stat)*sqrt(n)/(gamma*(t0+n))

The last term is identical to the adaptation scheme in section 3.2 of the Hoffman and Gelman (2014) NUTS paper.
The first term just regularizes the stepsize towards mu.
But wait, the multiplier (1-1/(t0+n))*sqrt(n/(n-1)) is larger than 1.0 when n < t0 so for the first ten iterations it pushes the stepsize away from mu rather than pulling towards mu. That can't be right, can it?

betanalpha · 2021-05-03T18:49:34Z

Not quite. adapt_next_window_ is not the first iteration of the next window but the last iteration of the current window. When adapt_base_window_ = 0 it's always the last iteration of the init buffer. There is only one reset if init_buffer > 0 and no reset at all if init_buffer = 0. In effect, the term buffer extends to fill the void left by the missing metric adaptation windows.

Technically when `adapt_base_window = 0` the next window is always set to the current iteration via `adapt_window_counter_` ``` adapt_next_window_ = adapt_window_counter_ + adapt_window_size_ = adapt_window_counter_ + 0 = adapt_window_counter_ ; ``` but `adapt_window_counter_` is incremented _after_ `compute_next_window()` is called so on the next iteration ``` adapt_window_counter_ == adapt_next_window_ == old_adapt_window_counter_ == adapt_window_counter_ - 1 ``` will evaluate to false and the adaptation will not trigger an update. But regardless the outcome is the same — only one reset immediately after the `init_buffer` is completed.

The initial step size can either be used to set x_bar_ = std::log(epsilon) instead of zero, which changes the optimization behavior The initial value of x_bar_ does not change the optimization behavior. learn_stepsize() updates x_bar_ according to x_eta = std::pow(counter_, -kappa_); x_bar_ = ( 1.0 - x_eta) * x_bar_ + x_eta * x; The first update has counter_ = 1. Regardless of the value of kappa_ the weight (1.0 - x_eta) is 0.0 and any finite initial value of x_bar_ is completely erased.

Sorry, yes, you are correct.

The answer is that for the initial window mu_ is set before the first init_stepsize() but for later windows mu_ resets after init_stepsize(). That seems inconsistent. The stepsize is not very good before init_stepsize().

Remember that the step size is adapted to the _average_ acceptance statistic across the entire parameter space. The step size initialization heuristic approximates this average by considering just one step and just the single initial point. If that point is far outside of the typical set, however, the local accept statistic won’t say much about the average acceptance statistic. I believe the current asymmetry is intentional — setting the initial mu based on a user-specified stepsize instead of the behavior around the initial point which may be far outside of the typical set, and then set mu based on the step size initialization heuristic only after the init_buffer has completed and the sampler has a decent shot of reaching the typical set.

• This discussion has been concerned with the "surprising" result when term_buffer = 0. But let's take a quick look at term_buffer = 1. Of course the step size then depends on what the accept_stat__ is for that single adaptation transition but I think that if the accept_stat__ just happens to equal adapt_delta then the "least surprising" result is that the step size does not change. It might not be exactly optimal but there's no clear reason to move it in either direction, right? The current behavior in that scenario is to set the step size to exp(mu_), i.e. 10 times the initial stepsize which feels a bit too large. There's an easy fix: make restart() initialize s_bar_ = gamma*log(10) instead of the current s_bar_ = 0. The most obvious side effect is removing the high spike right after every metric update.

Any average-based adaptation mechanism is going to be a bit unstable after only one iteration. Primal dual-averaging is set up to shrink the adapted parameter towards mu, with gamma setting the strength of that shrinkage. After 1 iteration where the acceptance statistic is exactly delta defaulting to that shrinkage value is natural response. This appears aggressive only because mu is set to be aggressive; those spikes are the adaptation aggressively trying out larger values of epsilon near mu to see if they’re any better before settling down to a smaller value. The problem with changing gamma is that it affects the shrinkage towards mu through the entire adaptation sequence, not just after a few iterations. If someone wants to run with a very small terminal buffer then they can set gamma to a very small value so that the shrinkage to mu is much weaker.

But wait, the multiplier (1-1/(t0+n))*sqrt(n/(n-1)) is larger than 1.0 when n < t0 so for the first ten iterations it pushes the stepsize away from mu rather than pulling towards mu. That can't be right, can it?

For n = 0 we have (1 - eta) = (1 - 1 / (t0 + n) ) = 1 - 1 / t0 < 1 because t0 > 0. For n > 0 we have (1 - eta) = 1 - 1 / (t0 + n) which is also bounded between 0 and 1 because both t0 and n are positive.

Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023)

dd40e69

bbbales2 and others added 3 commits March 19, 2021 11:40

Reverted header (Issue #3023)

7d2d478

Merge commit '792e9c242dfc1dc149d45ba94609c472edc21819' into HEAD

78839ba

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

4b0ae48

…4.1 (tags/RELEASE_600/final)

yizhang-yiz requested changes Mar 31, 2021

View reviewed changes

src/stan/mcmc/windowed_adaptation.hpp Outdated Show resolved Hide resolved

src/test/unit/mcmc/covar_adaptation_test.cpp Show resolved Hide resolved

bbbales2 and others added 3 commits April 3, 2021 13:47

Added test for sampler

2e2c450

Merge commit 'e064647320f4a0837904bc98103c5aac919ef914' into HEAD

e48c86b

[Jenkins] auto-formatting by clang-format version 6.0.0-1ubuntu2~16.0…

40ee1c2

…4.1 (tags/RELEASE_600/final)

nhuurre mentioned this pull request Apr 12, 2021

Skip metric regularization if adapt window=0 #3037

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023) #3027

Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023) #3027

bbbales2 commented Mar 19, 2021

bbbales2 commented Mar 19, 2021

stan-buildbot commented Mar 19, 2021

funko-unko commented Mar 19, 2021

nhuurre commented Mar 20, 2021

bbbales2 commented Mar 20, 2021

nhuurre commented Mar 20, 2021

betanalpha commented Mar 23, 2021

funko-unko commented Mar 23, 2021

bbbales2 commented Mar 23, 2021

yizhang-yiz commented Mar 24, 2021 •

edited

funko-unko commented Mar 25, 2021

nhuurre commented Mar 27, 2021

bbbales2 commented Mar 30, 2021

yizhang-yiz left a comment

funko-unko commented Mar 31, 2021

bbbales2 commented Apr 3, 2021

stan-buildbot commented Apr 3, 2021

betanalpha commented Apr 5, 2021 via email

nhuurre commented Apr 5, 2021

bbbales2 commented Apr 6, 2021

bbbales2 commented Apr 6, 2021

nhuurre commented Apr 6, 2021

funko-unko commented Apr 6, 2021

bbbales2 commented Apr 6, 2021

nhuurre commented Apr 6, 2021

funko-unko commented Apr 7, 2021

betanalpha commented Apr 15, 2021 via email

nhuurre commented Apr 16, 2021

betanalpha commented May 3, 2021 via email

Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023) #3027

Are you sure you want to change the base?

Stepsize no longer reset to 1 if term_buffer = 0 (Issue #3023) #3027

Conversation

bbbales2 commented Mar 19, 2021

Submission Checklist

Summary

Side Effects

Copyright and Licensing

bbbales2 commented Mar 19, 2021

stan-buildbot commented Mar 19, 2021

funko-unko commented Mar 19, 2021

nhuurre commented Mar 20, 2021

bbbales2 commented Mar 20, 2021

nhuurre commented Mar 20, 2021

betanalpha commented Mar 23, 2021

funko-unko commented Mar 23, 2021

bbbales2 commented Mar 23, 2021

yizhang-yiz commented Mar 24, 2021 • edited

funko-unko commented Mar 25, 2021

nhuurre commented Mar 27, 2021

bbbales2 commented Mar 30, 2021

yizhang-yiz left a comment

Choose a reason for hiding this comment

funko-unko commented Mar 31, 2021

bbbales2 commented Apr 3, 2021

stan-buildbot commented Apr 3, 2021

betanalpha commented Apr 5, 2021 via email

nhuurre commented Apr 5, 2021

bbbales2 commented Apr 6, 2021

bbbales2 commented Apr 6, 2021

nhuurre commented Apr 6, 2021

funko-unko commented Apr 6, 2021

bbbales2 commented Apr 6, 2021

nhuurre commented Apr 6, 2021

funko-unko commented Apr 7, 2021

betanalpha commented Apr 15, 2021 via email

nhuurre commented Apr 16, 2021

betanalpha commented May 3, 2021 via email

yizhang-yiz commented Mar 24, 2021 •

edited