Explosion of steps for specific parameter values #386

FFroehlich · 2024-03-13T11:57:25Z

I have been experiencing odd integration failures in large sets of solves of relatively small simply systems of equations. I have narrowed this down to a small example:

import diffrax
import jax
import jax.numpy as jnp
import matplotlib.pyplot as plt
import numpy as np

jax.config.update("jax_enable_x64", True)

a = jnp.array(
    [
        6.026932645397832,
        4.41195014234956,
        5.884199824299863,
        3.673504195449191,
        4.17957753821087,
    ]
)
b = jnp.float64(
    -2.823760940491063
)


def xdot(t, x, _):
    c = jnp.exp(a[1]) - x[0]
    d = x[1] / (c + jnp.exp(b))
    dxdt = jnp.asarray(
        (
            jnp.exp(a[3]) * d * c - jnp.exp(a[4]) * x[0],
            jnp.exp(a[3]) * (jnp.exp(a[0]) - x[1]),
        )
    )
    return dxdt


def xdot_lse(t, x, _):
    c = jnp.log(jnp.exp(a[1]) - x[0])
    d = jnp.log(x[1]) - jax.nn.logsumexp(jnp.array([b, c]))
    dxdt = jnp.asarray(
        (
            jnp.exp(a[3] + d + c) - jnp.exp(a[4]) * x[0],
            jnp.exp(a[3]) * (jnp.exp(a[0]) - x[1]),
        )
    )
    return dxdt


y0 = jnp.array([4.1154706432848185, 6.831774897154676])
ts = np.concatenate((np.array([0]), np.logspace(-6, 2, 20)))

sol = diffrax.diffeqsolve(
    diffrax.ODETerm(xdot_lse),
    solver=diffrax.Kvaerno5(),
    t0=0.0,
    t1=ts[-1],
    dt0=1e-8,
    y0=y0,
    stepsize_controller=diffrax.PIDController(
        atol=1e-8,
        rtol=1e-6,
        pcoeff=0.4,
        icoeff=0.3,
        dcoeff=0,
    ),
    max_steps=int(1e5),
    saveat=diffrax.SaveAt(ts=ts),
    throw=False,
)
x = sol.ys[-2, :]

for mag in np.logspace(-14, -4, 6):
    xx = np.linspace(-mag, mag, 200)
    fx_lse = jnp.asarray(
        [xdot_lse(0.0, x + jnp.asarray([eps, 0.0]), None)[0] for eps in xx]
    )
    fx = jnp.asarray(
        [xdot(0.0, x + jnp.asarray([eps, 0.0]), None)[0] for eps in xx]
    )
    f, axes = plt.subplots(1, 2)
    axes[0].plot(xx, fx_lse, marker='o', label='xdot_lse')
    axes[0].plot(xx, fx, marker='.', label='xdot')
    axes[0].set_ylabel('value of dx/dt[0] at x+eps')
    axes[0].set_xlabel('eps')
    axes[0].legend()
    axes[1].plot(xx, fx - fx_lse, marker='o')
    axes[1].set_ylabel('implementation difference dx/dt[0] at x+eps')
    axes[1].set_xlabel('eps')
    plt.tight_layout()
    plt.show()

The example should fail at the last time-point with about ~50k rejected steps and ~50k accepted steps. Minuscule changes to the parameters, e.g. changing the first entry in a from 6.026932645397832 to 6.02693264539783 allows the system to be solved in ~90 steps. This is odd as the systems is pretty close to a steady state when the integration fails and should be easy to integrate.

I initially thought this might be the result of some numerical instability, but I'm no longer convinced that this is the case. For example, changing d to d = x[1]/(c + jnp.exp(b[0])) (implemented in xdot) resolves the integration failure, but doesn't result in any appreciably improvement in numerical stability with which the right hand side can be evaluated (see plots generated at the end of the script). The magnitude of changes that I see are in the range of 1e-11 to 1e-12, which in my understanding shouldn't matter too much for the tolerances that I am using. Therefore, my conclusion is that I might be hitting some weird numerical edge-case.

The text was updated successfully, but these errors were encountered:

FFroehlich · 2024-03-13T20:58:52Z

Turns out I was still using version 0.4.1 of diffrax in that project, this error does not reproduce under 0.5.0. However, I want to note that the integration would succeed in 0.4.1 after adding jax.debug.print("state: {state}", state=controller_state, ordered=True) at

diffrax/diffrax/integrate.py

Line 253 in 7f30854

, which to me suggests a more serious issue. Will close for now and report back if I encounter something similar.

FFroehlich · 2024-03-13T22:03:15Z

Problem persists in 0.5.0 with slightly different parameter value, updated example above. Integration failure can still be "fixed" by adding jax.debug.print("state: {state}", state=controller_state, ordered=True) at

diffrax/diffrax/_integrate.py

Line 274 in d97ba20

patrick-kidger · 2024-03-15T22:42:58Z

Hmm, interesting. The fact that things change when you add jax.debug.print suggests that this is probably a floating-point thing.

Maybe try adding the debug statement at the very start or very end of the loop? I think it's much less likely that any optimisation/etc. passes are being applied to it there (JAX/XLA do far fewer optimisations across control flow boundaries), in which case that might offer a window into what's going on.

Another option for debugging might be to use SaveAt(steps=True), possibly in conjunction with varying t1 whilst looking at sol.stats.

FFroehlich · 2024-03-16T00:19:46Z

It only happens with a debug statement that involves the controller state.

FFroehlich · 2024-03-16T00:28:16Z

I have tried debugging this with steps=True and looking, but it hasn’t been particularly insightful. Any slight changes to the overall configuration tend to make the problem
magically disappear, but in sets of thousands of ODE solves there are always tends to be a handful that fail (assuming there is only one problem)

Why would the debug statement point to a floating point issue?

patrick-kidger · 2024-03-16T08:17:08Z

So adding in a debug statement changes the nature of the program very little. The only thing it really changes is to require that the outputted value must exist -- and in particular, that its node in the computation graph cannot be optimised by the compiler. For example, given an integer a, then

b = a + 1
c = b - 1

would probably just get optimised down to c = a. However if we added in a jax.debug.print('{}', b), then we will in fact have to compute b, and so the computation has changed ever-so-slightly.

In practice, optimisations are of course meant not to change the behaviour of the program. So if they do, it's usually due to any of the many twiddly floating point gotchas that can subtly adjust results.

None of the above is 100% btw, it's more a rule of thumb.

If it's only the controller state that causes issues with debugging, then that sounds like we can still insert jax.debug.print statements for everything else, and in doing so mostly see what's going on?

FFroehlich · 2024-03-16T10:13:18Z

Right, so I've done that and I understand what is going on: the nonlinear diverges once and then appears to be stuck and repeatedly fails. However, I don't see why the nonlinear solve would fail, it's a pretty simple problem pretty close to steady state and the solver has been taking huge steps with small predicted errors before.

You can see an edited debugging output here using

jax.debug.print("### diffsize: {diffsize}, diffsize_prev: {diffsize_prev}, rate: {rate}, factor: {factor}, small: {small}, diverged: {diverged}, converged: {converged}", diffsize=state.diffsize, diffsize_prev=state.diffsize_prev, rate=rate, factor=factor, small=small, diverged=diverged, converged=converged) at

diffrax/diffrax/_root_finder/_verychord.py

Line 165 in d97ba20

return terminate, result
jax.debug.print("## stage_index={stage_index}, result={result}", stage_index=stage_index, result=result) at

diffrax/diffrax/_solver/runge_kutta.py

Line 1058 in d97ba20

return (
jax.debug.print("# tprev: {tprev}, tnext:{tnext}, y: {y}, y_error: {y_error}, state: {state}, result: {result}", tprev=state.tprev, tnext=state.tnext, y=y, y_error=y_error, result=solver_result, state=state.solver_state) at

diffrax/diffrax/_integrate.py

Line 249 in d97ba20

Unfortunately, I don't seem to be able to use ordered=True in the print statements as it produces some error, but the printed statements appear to be well-ordered anyways.
I have removed most of the early outputs as there really isn't anything interesting happening and only kept the statements from _integrate to illustrate the stepsize the solver was taking before the failure. I have also adapted max_steps=int(1e2) to keep the output manageable:

https://gist.github.com/FFroehlich/a7378fcba87af32d307894e43dd82367

patrick-kidger · 2024-03-21T21:53:23Z

If it's specifically the nonlinear solve that is doing odd things, then perhaps this is specifically an issue with Optimistix. (Or with VeryChord?) Would it be possible to extract the inputs to the nonlinear solve we make, and consider that in isolation?

One immediate possible culprit that comes to mind is that you might be right on the edge of this acceptance criteria:

https://github.com/patrick-kidger/diffrax/blob/main/diffrax/_root_finder/_verychord.py#L18-L22

which might potentially have tolerances that are still slightly too loose.

FFroehlich · 2024-03-29T14:02:47Z

Well, isolating the inputs to the nonlinear solve is ~~a massive pain~~ non-trivial since it requires reconstructing the nonlinear function which depends on butcher tableaus etc. The way that

diffrax/diffrax/_solver/runge_kutta.py

Line 444 in d97ba20

def step(

is implemented means that I would have to copy a ton of code to reconstruct the inputs to

diffrax/diffrax/_solver/runge_kutta.py

Line 984 in d97ba20

nonlinear_sol = optx.root_find(

I tried, but lost interest after assembling ~100 lines of code that were scattered throughout the whole file and imported from elsewhere.

In the end, I am not really convinced that going through this exercise would prove particularly insightful. The non-linear solver diverges, which is bound to happen with newton schemes without globalisation strategy.

I suppose both termination and (lack of) globalisation strategies are largely motivated tribal knowledge and empirical choices based on a set of test problems that were established in the mathematics community decades ago. For example, sundials uses slightly different criteria https://github.com/LLNL/sundials/blob/2abd63bd6cbc354fb4861bba8e98d0b95d65e24a/src/cvodes/cvodes_nls.c#L303
compared to the conditions in diffrax, which seem to be based on the 1996 book by Wanner & Hairer. I haven't found any documentation/code what the Julia folks are using at the moment, but they appear to plan to give more flexibility through https://github.com/SciML/NonlinearSolve.jl.

patrick-kidger · 2024-03-29T15:59:46Z

That's fair! Reconstructing the inputs for that isn't super easy.

FWIW NonlinearSolve.jl is basically the equivalent of our Optimistix. In terms of globalisation mechanisms, you might find the relevant page of the Optimisix documentation interesting, as it discusses how Optimistix makes it possible to mix-and-match the various pieces of a globalisation mechanism. (Although we elected not to use the 'globalisation' terminology, which mostly just seems to promote confusion.)

Whilst such strategies aren't used by default in Diffrax, they are available to the advanced user. Diffrax will use any Optimistix root-finder you like (more precisely, anything implementing the optx.AbstractRootFinder interface), so you could implement one with a line search etc. if you wish. We've been meaning to add built-in support for this to optx.Newton, optx.Chord and diffrax.VeryChord -- analogous to what is already done in e.g. optx.GaussNewton -- and just haven't gotten around to it yet.

(If you think that's the appropriate solution to your problem, and are feeling sufficiently motivated, then we'd certainly be happy to take a PR on that!)

FFroehlich · 2024-04-06T18:25:41Z

Okay, I managed to consider the non-linear solve in isolation by starting integration at the problematic integration step.

The non-linear solver failures can be remedied by a variety of things, "re-starting" the solver by not passing the respective solver_state, changing the divergence check to be a bit more lenient (the newton method will converge in the next step).

I no longer think this is a bug, but rather the non-linear solver not being sufficiently robust for my purposes. It's probably the combination of large sample sizes and limited numerical accuracy of the right hand side. Close/At to steady-state, the newton step is effectively just noise, so the divergence check will just fail occasionally. In this case it seems to fail remarkably often even in successive integration steps, which might hint at some other structure to the problem that I am still missing. _small, which should likely handle such settings, checks for diffsize < 1e-13 when using double precision, which I think is too strict. diffsize is computed as ||diff / (atol + y_new * rtol)||, so for example with rtol=0.0 and atol=1e-8, this would check for ||diff||<1e-21, which does not seem reasonable.

I have re-implemented a convergence check that is similar (rate is not stored in state) to what is done in sundials

class SundialsChord(diffrax.VeryChord):
    def terminate(
        self,
        fn: callable,
        y,
        args,
        options,
        state,
        tags: frozenset[object],
    ):
        del fn, y, args, options, tags
        rate = state.diffsize / state.diffsize_prev
        factor = state.diffsize * jnp.nanmin(jnp.array([1.0, rate])) / 0.2
        converged = jnp.logical_and(state.step >= 2, factor < 1.0)
        diverged = jnp.logical_and(
            state.step >= 2, state.diffsize > 2.0 * state.diffsize_prev
        )
        terminate = diverged | converged
        terminate_result = optx.RESULTS.where(
            diverged | jnp.invert(converged),
            optx.RESULTS.nonlinear_divergence,
            optx.RESULTS.successful,
        )
        linsolve_fail = state.result != optx.RESULTS.successful
        result = optx.RESULTS.where(
            linsolve_fail, state.result, terminate_result
        )
        terminate = linsolve_fail | terminate
        return terminate, result

which solves the particular problem and uses a bit fewer steps, but doesn't appear to work well on other problems.

patrick-kidger · 2024-04-07T07:19:44Z

Nice! I'm really glad you got this working. :)

FFroehlich closed this as completed Mar 13, 2024

FFroehlich reopened this Mar 13, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explosion of steps for specific parameter values #386

Explosion of steps for specific parameter values #386

FFroehlich commented Mar 13, 2024 •

edited

FFroehlich commented Mar 13, 2024 •

edited

FFroehlich commented Mar 13, 2024 •

edited

patrick-kidger commented Mar 15, 2024

FFroehlich commented Mar 16, 2024

FFroehlich commented Mar 16, 2024

patrick-kidger commented Mar 16, 2024

FFroehlich commented Mar 16, 2024

patrick-kidger commented Mar 21, 2024

FFroehlich commented Mar 29, 2024

patrick-kidger commented Mar 29, 2024 •

edited

FFroehlich commented Apr 6, 2024 •

edited

patrick-kidger commented Apr 7, 2024

Explosion of steps for specific parameter values #386

Explosion of steps for specific parameter values #386

Comments

FFroehlich commented Mar 13, 2024 • edited

FFroehlich commented Mar 13, 2024 • edited

FFroehlich commented Mar 13, 2024 • edited

patrick-kidger commented Mar 15, 2024

FFroehlich commented Mar 16, 2024

FFroehlich commented Mar 16, 2024

patrick-kidger commented Mar 16, 2024

FFroehlich commented Mar 16, 2024

patrick-kidger commented Mar 21, 2024

FFroehlich commented Mar 29, 2024

patrick-kidger commented Mar 29, 2024 • edited

FFroehlich commented Apr 6, 2024 • edited

patrick-kidger commented Apr 7, 2024

FFroehlich commented Mar 13, 2024 •

edited

FFroehlich commented Mar 13, 2024 •

edited

FFroehlich commented Mar 13, 2024 •

edited

patrick-kidger commented Mar 29, 2024 •

edited

FFroehlich commented Apr 6, 2024 •

edited