New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explosion of steps for specific parameter values #386
Comments
Turns out I was still using version 0.4.1 of diffrax in that project, this error does not reproduce under 0.5.0. However, I want to note that the integration would succeed in 0.4.1 after adding Line 253 in 7f30854 |
Problem persists in Line 274 in d97ba20 |
Hmm, interesting. The fact that things change when you add Maybe try adding the debug statement at the very start or very end of the loop? I think it's much less likely that any optimisation/etc. passes are being applied to it there (JAX/XLA do far fewer optimisations across control flow boundaries), in which case that might offer a window into what's going on. Another option for debugging might be to use |
It only happens with a debug statement that involves the controller state. |
I have tried debugging this with steps=True and looking, but it hasn’t been particularly insightful. Any slight changes to the overall configuration tend to make the problem Why would the debug statement point to a floating point issue? |
So adding in a debug statement changes the nature of the program very little. The only thing it really changes is to require that the outputted value must exist -- and in particular, that its node in the computation graph cannot be optimised by the compiler. For example, given an integer b = a + 1
c = b - 1 would probably just get optimised down to In practice, optimisations are of course meant not to change the behaviour of the program. So if they do, it's usually due to any of the many twiddly floating point gotchas that can subtly adjust results. None of the above is 100% btw, it's more a rule of thumb. If it's only the controller state that causes issues with debugging, then that sounds like we can still insert |
Right, so I've done that and I understand what is going on: the nonlinear diverges once and then appears to be stuck and repeatedly fails. However, I don't see why the nonlinear solve would fail, it's a pretty simple problem pretty close to steady state and the solver has been taking huge steps with small predicted errors before. You can see an edited debugging output here using
Unfortunately, I don't seem to be able to use https://gist.github.com/FFroehlich/a7378fcba87af32d307894e43dd82367 |
If it's specifically the nonlinear solve that is doing odd things, then perhaps this is specifically an issue with Optimistix. (Or with One immediate possible culprit that comes to mind is that you might be right on the edge of this acceptance criteria: https://github.com/patrick-kidger/diffrax/blob/main/diffrax/_root_finder/_verychord.py#L18-L22 which might potentially have tolerances that are still slightly too loose. |
Well, isolating the inputs to the nonlinear solve is diffrax/diffrax/_solver/runge_kutta.py Line 444 in d97ba20
diffrax/diffrax/_solver/runge_kutta.py Line 984 in d97ba20
I tried, but lost interest after assembling ~100 lines of code that were scattered throughout the whole file and imported from elsewhere. In the end, I am not really convinced that going through this exercise would prove particularly insightful. The non-linear solver diverges, which is bound to happen with newton schemes without globalisation strategy. I suppose both termination and (lack of) globalisation strategies are largely motivated tribal knowledge and empirical choices based on a set of test problems that were established in the mathematics community decades ago. For example, sundials uses slightly different criteria https://github.com/LLNL/sundials/blob/2abd63bd6cbc354fb4861bba8e98d0b95d65e24a/src/cvodes/cvodes_nls.c#L303 |
That's fair! Reconstructing the inputs for that isn't super easy. FWIW Whilst such strategies aren't used by default in Diffrax, they are available to the advanced user. Diffrax will use any Optimistix root-finder you like (more precisely, anything implementing the (If you think that's the appropriate solution to your problem, and are feeling sufficiently motivated, then we'd certainly be happy to take a PR on that!) |
Okay, I managed to consider the non-linear solve in isolation by starting integration at the problematic integration step. The non-linear solver failures can be remedied by a variety of things, "re-starting" the solver by not passing the respective solver_state, changing the divergence check to be a bit more lenient (the newton method will converge in the next step). I no longer think this is a bug, but rather the non-linear solver not being sufficiently robust for my purposes. It's probably the combination of large sample sizes and limited numerical accuracy of the right hand side. Close/At to steady-state, the newton step is effectively just noise, so the divergence check will just fail occasionally. In this case it seems to fail remarkably often even in successive integration steps, which might hint at some other structure to the problem that I am still missing. I have re-implemented a convergence check that is similar (rate is not stored in state) to what is done in sundials
which solves the particular problem and uses a bit fewer steps, but doesn't appear to work well on other problems. |
Nice! I'm really glad you got this working. :) |
I have been experiencing odd integration failures in large sets of solves of relatively small simply systems of equations. I have narrowed this down to a small example:
The example should fail at the last time-point with about ~50k rejected steps and ~50k accepted steps. Minuscule changes to the parameters, e.g. changing the first entry in
a
from6.026932645397832
to6.02693264539783
allows the system to be solved in ~90 steps. This is odd as the systems is pretty close to a steady state when the integration fails and should be easy to integrate.I initially thought this might be the result of some numerical instability, but I'm no longer convinced that this is the case. For example, changing
d
tod = x[1]/(c + jnp.exp(b[0]))
(implemented inxdot
) resolves the integration failure, but doesn't result in any appreciably improvement in numerical stability with which the right hand side can be evaluated (see plots generated at the end of the script). The magnitude of changes that I see are in the range of 1e-11 to 1e-12, which in my understanding shouldn't matter too much for the tolerances that I am using. Therefore, my conclusion is that I might be hitting some weird numerical edge-case.The text was updated successfully, but these errors were encountered: