Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

structure quasi-static solver try/catch too general #658

Open
richardschu opened this issue Apr 17, 2024 · 8 comments
Open

structure quasi-static solver try/catch too general #658

richardschu opened this issue Apr 17, 2024 · 8 comments

Comments

@richardschu
Copy link
Contributor

I think the try/catch block in the structure's quasi-static solver is too generous and can make finding bugs in any called linear solver quite hard. If you had any assertion thrown within the block, the solver would just state something like:

An error occurred in line <188> of file </home/richardschussnig/dealii-candi/exadg/include/exadg/structure/time_integration/driver_quasi_static_problems.cpp> in function
void ExaDG::Structure::DriverQuasiStatic<dim, Number>::do_solve() [with int dim = 3; Number = double]
The violated condition was:
success
Additional information:
Could not solve quasi static problem even after reducing the load
increment.

The only assertion we would like to catch is Newton solver divergence, which expresses itself in overflow/underflow some NaN possibly and linear solver nonconvergence ... maybe more? So identifying all these expections to catch would be one way.

Another way would be to pass in a boolean flag newton_converged, whichs state is false originally and set to true only if everything went as expected. Then, we would remove the AssertThrows here and here in the Newton solver and simply rely on the returned flag (+ returnstatements). Then we could get rid of the try/catch block.

Yet another option would be to define custom exceptions we can then throw and catch outside - maybe the cleanest option, were we do not rely on a flag passed around several software layers!

What are your opinions?
Once we converge at something, I volunteer to make a draft 🚀

@nfehn
Copy link
Member

nfehn commented Apr 18, 2024

I think I do not have a clear answer to this right now.

Let me start with general considerations: A main problem that I see is that it is generally difficult to identify the precise reason for convergence failure of a linear/nonlinear solver. Many users have probably experienced that too large a time step for incompressible flows causes convergence failure of the pressure Poisson solver, and that it needs some experience to link this convergence failure to a violation of the CFL condition. For this example, I currently consider it difficult to considerably improve the situation from a user perspective by doing more refined throw/catch techniques for the linear solver, apart from listing/printing many possible reasons for convergence failure (collected over time from user experience) in case the problem fails to converge. So I have some concerns that we can come up with kind of an automatic solution to this type of problems, but I am very happy if you see possibilities for improvements in this direction. I could imagine it can also happen that the same problem might cause failure of the Newton solver or failure of the linear solver (depending on some other parameters), making it hard to draw conclusions from try/catch techniques. Regarding e.g. the (other) problem with invalid deformation states (Jacobian) for structural problems with large deformation, in my experience it helped more to explicitly do such checks (is the Jacobian valid?) than trying to react to non-convergence of a solver. This would imply we need to understand the problem that causes convergence failure and then integrate this knowledge into the code.

In the beginning, you write "bugs", where I am not sure what you mean exactly, mistakes in implementation? Currently, I consider bugs a different topic than try/catch of run time errors.

Another way would be to pass in a boolean flag newton_converged, whichs state is false originally and set to true only if everything went as expected. Then, we would remove the AssertThrows here and here in the Newton solver and simply rely on the returned flag (+ returnstatements). Then we could get rid of the try/catch block.

I am not sure if I understood this correctly. Wouldn't this mean the simulation terminates in case the linear solver fails / throws an error (in case the outer try/catch block is removed)? However, from a user perspective there might still be the motivation to repeat the load step with a reduced load factor for example. Another disadvantage of removing Asserts from the Newton solver is that one would then have to implement corresponding messages at many places in the code that make use of the Newton solver. So I currently think additional parameters for the convergence state in the Newton solver (which might principally be helpful) do not allow to remove outer try/catch blocks.

@kronbichler
Copy link
Contributor

I think one of the intents of this issue is to discuss if we would like to try to separate convergence issues of any kind from other assertions we might throw, assertions that check missing parameters or missing cases. So in a sense, I believe that

catch(...)
{
// undo changes in solution vector
solution = old_solution;
++re_try_counter;

should not just pick any exception ..., but only a specific type of exceptions, e.g. those where the linear solver or nonlinear solver fail to converge, with infinite norms and other behavior. Unfortunately, deal.II is not really stringent in throwing useful exceptions for NaN or similar conditions (so I think we would need to do it in ExaDG, but I admit I have not looked closely), many of those actually use Assert instead of AssertThrow.

To give an example, if one of the ingredients inside the solver would throw an exception that a particular feature is not implemented inside the Newton solver, then this catch(...) block would misleadingly report that some convergence issue is happening, and re-try the solver again. Is this what you observed @richardschu? If that is the kind of error you are looking at, I think we should try to improve the situation by filtering the exception type in the try/catch block.

@richardschu
Copy link
Contributor Author

Thank you for your detailed answers and taking the time to type them down here!
I fully agree that automatically catching and handling exceptions that hint at divergence of a nonlinear or linear solver is tricky and is a complex matter. Here, I want to focus on letting the user or developer know what was the problem encountered. So I think this comes even before trying to redo some timesteps or redo some load steps etc. Also, I want to preserve the functionality of the quasi-static solver being able to redo some load steps with reduced step size, since this can safe ones 🍑 in some cases.

In the beginning, you write "bugs", where I am not sure what you mean exactly, mistakes in implementation? Currently, I consider bugs a different topic than try/catch of run time errors.
I was having AssertThrow(false, dealii::ExcMessage("reason 1"));, and did not realize this bug, since the "reason 1" was never printed to screen, but instead the loadstep was reduced until the final step with basically zero loading was failing as well. I think we should communicate the specific exception messages and try to continue with the nonlinear solve.

However, from a user perspective there might still be the motivation to repeat the load step with a reduced load factor for example. Another disadvantage of removing Asserts from the Newton solver is that one would then have to implement corresponding messages at many places in the code that make use of the Newton solver.

I fully agree, I suggested to preserve the functionality, but using a different implementation (flags instead of exceptions). That was maybe a bit unclear, but I think passing around the flags is tedious anyways and custom exceptions might do the trick in a more elegant fashion.

many of those actually use Assert instead of AssertThrow
That is unfortunate, but better than nothing. Mayb we can/should(?) check them in ExaDG. The question would be if they are performance critical in production runs or if one wants those anyways only for debugging or during development. I will present some tests below.

Is this what you observed @richardschu?
Yes, exactly. The problem I encountered is however similar to more relevant ones as described above. I would argue we need asserts and we need to know what was actually going wrong --- and then still continue to do the load stepping just as we do now.

@richardschu
Copy link
Contributor Author

My current solution goes along the lines of what @kronbichler suggested: introduce custom exceptions and catch them:
The exceptions:
exc def
Throwing them:
exc 1
exc 2
-> note that these reproduce exactly the behavior we had before.
The try/catch block in the quasi-static solver:
trycatch
So the Newton exceptions are caught properly, the exc.what() lets the user know which kind of exception was triggered. We could go ahead and also catch the linear solver non-convergence etc.

@richardschu
Copy link
Contributor Author

The current behavior of this example code in debug mode is:
1.) value / 0.0 in operator evaluation:
asserts with detailed message:
An error occurred in line <1640> of file </home/richard/dealii-candi/dealii/source/lac/trilinos_sparse_matrix.cc> in function void dealii::TrilinosWrappers::SparseMatrix::add(dealii::TrilinosWrappers::SparseMatrix::size_type, dealii::TrilinosWrappers::SparseMatrix::size_type, const size_type*, const TrilinosScalar*, bool, bool) The violated condition was: dealii::numbers::is_finite(values[j])
(and ofc does then not reduce the load step, but aborts)
2) linear solver non-convergence: (testing with one iteration of the linear solver and requiring zero residual, which is impossible in general)
asserts, repeats load step as intended, but with detailed error message:
An error occurred in line <2237> of file </home/richard/dealii-candi/dealii/include/deal.II/lac/solver_gmres.h> in function void dealii::SolverFGMRES<VectorType>::solve(const MatrixType&, VectorType&, const VectorType&, const PreconditionerType&) [with MatrixType = ExaDG::Structure::NonLinearOperator<3, double>; PreconditionerType = ExaDG::PreconditionerBase<double>; VectorType = dealii::LinearAlgebra::distributed::Vector<double, dealii::MemorySpace::Host>] Additional information: Iterative method reported convergence failure in step 1. The residual in the last step was 4.25468.
3) nonlinear solver non-converge: (testing with a single Newton step and requiring zero residual, which is impossible in general)
prints short message, reduces loadstep

@richardschu
Copy link
Contributor Author

The current behavior of this example code in release mode is:
1.) value / 0.0 in operator evaluation:
asserts, but with the exception in the linear solver with detailed message:
The violated condition was: solver_state == SolverControl::success Additional information: Iterative method reported convergence failure in step 1. The residual in the last step was nan.
So this does not detect the invalid value in the operator evaluation, but the solver cannot converge, and in fact exits in iteration one (probably a good hint already?).
2) linear solver non-convergence: (testing with one iteration of the linear solver and requiring zero residual, which is impossible in general)
asserts, repeats load step as intended, but with detailed error message:
Additional information: Iterative method reported convergence failure in step 1. The residual in the last step was 4.19586.
3) nonlinear solver non-converge: (testing with a single Newton step and requiring zero residual, which is impossible in general)
prints short message, reduces loadstep

@richardschu
Copy link
Contributor Author

I think this is actually already quite close to what we want as a first step?
Reducing the error message might be interesting, but how could that be achieved?
Also, for the debug and "*1/0" zero case it would be great, if the solver would not just abort but continue in reducing the load etc. as it does in release mode.
What do you guys think?

@kronbichler
Copy link
Contributor

Thank you for the experiment. I think case 2 and case 3 perform as expected (apart from the error message possibly, but that is a minor issue). For case 1, the difference in behavior for debug and release mode is a bit annoying, so here we should think about a better approach. I think that the case when we have NaN values should still try to reduce the load step/time step or do whatever other action we have foreseen for non-convergence, because the source of NaN could be run-time problems, such as a too aggressive update of a position. On the other hand, cases like "not implemented" should abort immediately and not try to repeat steps, and this is where

This means that I agree with the points @nfehn made about the difficulty for automatically detecting non-convergence and the usefulness of repeating steps. Of course we should also work on other mechanisms to detect failure where they fit, because they might be more informative than the linear/nonlinear solver no-convergence.

Regarding your code @richardschu I believe we need to be more restrictive, because if we use something like AssertThrow(false, dealii::ExcNotImplemented());, then that assertion also inherits from std::exception (via dealii::ExceptionBase), so what I imagined would not work yet. So this means we would need to catch our no-convergence exceptions more explicitly. Or how would you approach it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants