Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

If model evaluates at initial conditions without autodiff but fails with autodiff then the error messages are not great #3034

Open
bbbales2 opened this issue Mar 31, 2021 · 10 comments

Comments

@bbbales2
Copy link
Member

Summary:

For instance, with an ODE model @jtimonen found output like this:

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)

We would expect an error message like:

Rejecting initial value:
  Error evaluating the log probability with gradients at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)

Initialization between (-2, 2) failed after 100 attempts. 
 Try specifying initial values, reducing ranges of constrained values, or reparameterizing the model.
Initialization failed.

Here is a model that exhibits this behavior at certain seeds (model below):

functions {
  // Lotka-Volterra system                                                                                                                                                                    
  vector derivative_fun(real t, vector y, data int[] a0, vector theta) {
    vector[2] dydt;
    dydt[1] = theta[1]*y[1] - y[1]*y[2];
    dydt[2] = y[1]*y[2] - theta[2]*y[2];
    return dydt;
  }
}

data {
  int<lower=1> N;
  real t_eval[N]; // must be increasing                                                                                                                                                       
  vector[2] y_data[N];
  vector[2] y0;
  real t0;
  real<lower=0> ABS_TOL;
  real<lower=0> REL_TOL;
  int<lower=1> MAX_NUM_STEPS;
}

transformed data {
  int a0[0];
}

parameters {
  vector<lower=0>[2] theta;
  real<lower=0> sigma;
}

model {
  theta ~ normal(1, 0.3);
  sigma ~ normal(0, 2.0);
  vector[2] y_hat[N] = ode_bdf_tol(derivative_fun, y0, t0, t_eval,
    REL_TOL, ABS_TOL, MAX_NUM_STEPS, a0, theta);
  for(n in 1:N){
    target += normal_lpdf(y_data[n] | y_hat[n], sigma);
  }
}

Json data:

{
  "N": 20,
  "t_eval": [1.542, 3.314, 3.582, 3.942, 4.618, 5.974, 6.338, 6.546, 6.698, 7.106, 7.53, 7.602, 7.758, 8.334, 8.45, 9.018, 9.17, 10.886, 11.366, 11.574],
  "y0": [1, 2],
  "t0": 0,
  "y_data": [
    [0.344797088393873, 1.6930242060351],
    [0.635418289383446, 0.0906638098070269],
    [0.519516090322978, 0.821024324067281],
    [1.07889503979525, 0.477741038303985],
    [0.773776915048449, 0.255210793923543],
    [1.72581484166438, 1.5527620189067],
    [0.851008213891165, 1.80589311943048],
    [1.62769164960979, 1.57197163973671],
    [1.06907814440105, 1.77751680937944],
    [0.559461058239121, 1.77778520957167],
    [0.319224257399646, 2.09229342823705],
    [0.656623356899921, 2.35616034285975],
    [0.266800987984575, 1.59088141802012],
    [-0.0843099951671533, 1.29952132205262],
    [0.510139049085921, 1.16569483217513],
    [0.739499156159327, 1.5954691564362],
    [0.300537914354303, 1.33024642117391],
    [0.453591698548004, 0.627159819286871],
    [1.13100675948511, 0.372307762827462],
    [1.63136411638082, 1.04814114632482]
  ],
  "REL_TOL": 1e-06,
  "ABS_TOL": 1e-06,
  "MAX_NUM_STEPS": 30
}

Current Version:

v2.26.1

@jtimonen
Copy link

Yea, so it seems that the current code assumes that if computing log probability succeeds then also computing the gradient will succeed. With ODEs however, it is possible that log_prob succeeds for a given max_num_steps, but to compute the gradient, an extended system has to be solved and it fails. There could be also other models that have this sort of behaviour.

@betanalpha
Copy link
Contributor

betanalpha commented Apr 5, 2021 via email

@bbbales2
Copy link
Member Author

bbbales2 commented Apr 6, 2021

It is throwing an error in log_prob_grad. Seems like it is printing the error twice. Not sure why that's happening.

I modified the initialize.hpp code and added some stuff around line 160:

    std::cout << "meow" << std::endl << std::flush;

    std::stringstream log_prob_msg;
    std::vector<double> gradient;
    auto start = std::chrono::steady_clock::now();
    try {
      // we evaluate this with propto=true since we're                                                                                                                                                                       
      // evaluating with autodiff variables                                                                                                                                                                                  
      log_prob = stan::model::log_prob_grad<true, Jacobian>(
          model, unconstrained, disc_vector, gradient, &log_prob_msg);
    } catch (const std::exception& e) {
      if (log_prob_msg.str().length() > 0)
        logger.info(log_prob_msg);
      logger.info(e.what());
      throw;
    }

    std::cout << "woof" << std::endl << std::flush;

Output is:

Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
meow
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
~/cmdstan-develop$ 

@betanalpha
Copy link
Contributor

You need to separate out loop iterations, for example by printing "num_init_tries". When "Error evaluating the log probability at the initial value." is passed to the logger the gradient evaluation isn't even attempted due to the "continue" statement on line 140. My guess is that the first attempt fails with a bad function evaluation and then on the second loop iteration the function evolution passes but then the gradient evolution fails which triggers the throw on line 172.

If this is the case then all we need to do is add a logger message before 171 to clarify the failure of 'log_prog_grad,

      logger.info(
          "  Error evaluating the gradient of the log probability"
          " at the initial value.");

@bbbales2
Copy link
Member Author

We could throw a more informative error but the code will behave more robustly if we just initialize where gradients also work so it's possible to start running HMC.

I believe that is the intention of this code (just find anywhere to start), and the difficulty of doing the ODE solves causes this problem.

@betanalpha
Copy link
Contributor

betanalpha commented May 3, 2021 via email

@bbbales2
Copy link
Member Author

bbbales2 commented May 9, 2021

and is symmetric between double and var evaluations.

I don't know what symmetry means here. If you mean it's the same checks between the tries on line 124 and 163, they are not.

The try on 163 does not catch domain errors and continue running. Any exception it catches it re-throws.

CVODES throws a non-domain exception when the integration of the sensitivity states fails, triggering the initialization to terminate as expected from the current implementation.

It's a domain exception here.

You can change the code to:

auto start = std::chrono::steady_clock::now();
try {
  // we evaluate this with propto=true since we're                                                                                                        
  // evaluating with autodiff variables                                                                                                                   
  log_prob = stan::model::log_prob_grad<true, Jacobian>(
      model, unconstrained, disc_vector, gradient, &log_prob_msg);
} catch (const std::domain_error& e) {
  std::cout << "It's a domain error" << std::endl;
  if (log_prob_msg.str().length() > 0)
    logger.info(log_prob_msg);
  logger.info(e.what());
  throw;
} catch (const std::exception& e) {
  if (log_prob_msg.str().length() > 0)
    logger.info(log_prob_msg);
  logger.info(e.what());
  throw;
}
auto end = std::chrono::steady_clock::now();

And you'll see in the output when this problem comes up:

Rejecting initial value:
  Error evaluating the log probability at the initial value.
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
It's a domain error
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)
Exception: ode_bdf_tol:  Failed to integrate to next output time (1.542) in less than max_num_steps steps (in 'lv.stan', line 34, column 2 to line 35, column 48)

Anyway, this is the problem that I want to fix. These are recoverable errors and we should be recovering from them in the same way as we do already.

We should pick an initialization for the sampler where it is possible to both evaluate the log density and the gradients. Right now we do not check that we are able to evaluate the gradients. I do not think that is correct considering we need the gradients to do the sampling. That is the change I want to make over here.

@bbbales2
Copy link
Member Author

@betanalpha

@wds15
Copy link
Contributor

wds15 commented May 17, 2021

Bump. @betanalpha this looks all good to me. The intent to error out when the chosen initial leads to a nan for the log-likelihood or it's gradient is sensible, since the sampler needs both in order to move. I am happy to review the PR for this so that we get this into the release.

@betanalpha
Copy link
Contributor

betanalpha commented May 19, 2021 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants