Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edge collapse cause Numerical Issue in solving L2 projection #302

Open
1 of 3 tasks
liuyulinn opened this issue Jan 16, 2024 · 12 comments · Fixed by #303
Open
1 of 3 tasks

Edge collapse cause Numerical Issue in solving L2 projection #302

liuyulinn opened this issue Jan 16, 2024 · 12 comments · Fixed by #303
Assignees
Labels
bug Something isn't working

Comments

@liuyulinn
Copy link

Describe the bug
Thanks for your amazing work!
However, when I run your in-timestep-remeshing code, I encountered Segmentation fault (core dumped). I find that this bug encounters the first time successfully collapsing an edge, and occurs in constrained_L2_projection() function, al_solver.solve_reduced(nl_solver, problem, sol);
when the Newton solver (in PolySolve) uses linear_solver0>analyze_pattern(hessian, hessian.rows()); .

I guess it's a bug related to edge collapse or L2 projection because when I try sizing field method, after successfully collapsing the first edge, in unconstrained_L2_projection() function, it causes Numerical Issue in solver->factorize(M).

I find that it may due to Eigen::PardisoLLT solver used in Newton solver or Eigen::SimplicialLDLT solver used in unconstrained_L2_projection requires that the matrix to be positive-definite. However, in the above cases, the determinate of the matrix is zero.

This bug can be forcely solved by adding a small positive diagonal matrix before using these solvers, however, it will leads to cpu down (get stucked in solver) and terminal killed. So I guess there's some underlying bug.

To re-produce the bug, you can try:
./PolyFEM_bin -j in-timestep-remeshing-data/scripts/masticator/h=0.1.json
I'm using the json file in https://github.com/zfergus/in-timestep-remeshing-data.git.

Platform

  • Windows
  • macOS
  • Linux
@liuyulinn liuyulinn added the bug Something isn't working label Jan 16, 2024
@zfergus zfergus self-assigned this Jan 17, 2024
@zfergus
Copy link
Member

zfergus commented Jan 17, 2024

I have a fix for this I will push it today.

@liuyulinn
Copy link
Author

I have checked your pull request: Fix mass matrix in L2 projection #303. However, it still can't work. Got Segmentation fault (core dumped) again.

@zfergus zfergus reopened this Jan 17, 2024
@zfergus
Copy link
Member

zfergus commented Jan 17, 2024

Strange... it is working for me.

Can you run it in debug mode? The easiest way to do this is to run the sim in release until it crashes and then use the restart data to restart in debug mode. For example

./PolyFEM_bin -j ~/Desktop/bug/restart_087.json -o ~/Desktop/bug-restart

Can you also try the updated in-timestep data? I made some changes to common.json to make it compatible with the latest PolyFEM.

@liuyulinn
Copy link
Author

When I restart I still get:

[2024-01-17 14:12:57.745] [polyfem] [debug] [accept] E0=0.157682   E1=0.157639   (E0-E1)=4.2706e-05  tol=-2.5e-11 local_ndof=20 n_iters=1
[2024-01-17 14:12:57.779] [polyfem] [error] [SparseProjectedNewton][Backtracking] Finished: Iteration limit reached. Took 0.00458614s (niters=5 f=0.294432 Δf=3.92343e-05 ‖∇f‖=0.0300763 ‖Δx‖=0.000478029) (stopping criteria: max_iters=5 Δf=0 ‖∇f‖=5e-10 ‖Δx‖=5e-05)
[2024-01-17 14:12:57.781] [polyfem] [debug] [reject] E0=0.123252   E1=0.294424   (E0-E1)=-0.171172   tol=-2.5e-11 local_ndof=18 n_iters=5
[2024-01-17 14:12:57.781] [wmtk] [info] cnt_success 1 cnt_fail 4
[2024-01-17 14:12:57.782] [polyfem] [info] [split]    aggregate_cnt_success 2 aggregate_cnt_fail 142
[2024-01-17 14:12:57.782] [polyfem] [info] [collapse] aggregate_cnt_success 1 aggregate_cnt_fail 4
[2024-01-17 14:12:57.782] [polyfem] [info] [swap]     aggregate_cnt_success 0 aggregate_cnt_fail 0
[2024-01-17 14:12:57.782] [polyfem] [info] [smooth]   aggregate_cnt_success 0 aggregate_cnt_fail 0
[2024-01-17 14:12:57.819] [polyfem] [debug] Successfully applied boundary conditions; solving in reduced space:
[2024-01-17 14:12:57.820] [polyfem] [debug] Starting SparseNewton with Backtracking solve f₀=-0.207228 (stopping criteria: max_iters=3000 Δf=0 ‖∇f‖=5e-12 ‖Δx‖=5e-05)
Segmentation fault (core dumped)

But I will try the updated in-timestep data next.

@zfergus
Copy link
Member

zfergus commented Jan 17, 2024

Is this run in debug mode (i.e. run CMake with -DCMAKE_BUILD_TYPE=Debug)?

@liuyulinn
Copy link
Author

liuyulinn commented Jan 17, 2024

It is run in debug mode. I still got:

[2024-01-17 14:34:20.257] [wmtk] [info] cnt_success 1 cnt_fail 4
[2024-01-17 14:34:20.258] [polyfem] [info] [split]    aggregate_cnt_success 2 aggregate_cnt_fail 142
[2024-01-17 14:34:20.258] [polyfem] [info] [collapse] aggregate_cnt_success 1 aggregate_cnt_fail 4
[2024-01-17 14:34:20.258] [polyfem] [info] [swap]     aggregate_cnt_success 0 aggregate_cnt_fail 0
[2024-01-17 14:34:20.258] [polyfem] [info] [smooth]   aggregate_cnt_success 0 aggregate_cnt_fail 0
[2024-01-17 14:34:20.972] [polyfem] [debug] Successfully applied boundary conditions; solving in reduced space:
[2024-01-17 14:34:20.973] [polyfem] [debug] Starting SparseNewton with Backtracking solve f₀=-0.206704 (stopping criteria: max_iters=3000 Δf=0 ‖∇f‖=5e-12 ‖Δx‖=5e-05)
Segmentation fault (core dumped)

@zfergus
Copy link
Member

zfergus commented Jan 17, 2024

Can you use gdb or lldb to find where the segfault comes from?

@liuyulinn
Copy link
Author

Thread 1 "PolyFEM_bin" received signal SIGSEGV, Segmentation fault.
0x000055555747a0e7 in mkl_pds_lp64_sssfct_pardiso ()
(gdb) bt
#0  0x000055555747a0e7 in mkl_pds_lp64_sssfct_pardiso ()
#1  0x000055555743f128 in mkl_pds_lp64_factorize_pardiso ()
#2  0x000055555743b2cd in mkl_pds_lp64_do_all_pardiso_fc ()
#3  0x00005555571d971c in mkl_pds_lp64_pardiso_c ()

#4  0x00005555571d0e74 in mkl_pds_lp64_pardiso ()
#5  0x0000555556c712fd in Eigen::internal::pardiso_run_selector<int>::run (pt=0x555560b923a8, maxfct=1, mnum=1, type=2, phase=22, n=1072, a=0x555560f5cf20, 
    ia=0x555560f58c00, ja=0x555560f6bad0, perm=0x555560f32390, nrhs=0, iparm=0x555560b925a8, msglvl=0, b=0x0, x=0x0)
    at /home/yulin/.cache/CPM/eigen/9cb760a6fcc3143cbb7856ce4548b1cf4b385f2e/Eigen/src/PardisoSupport/PardisoSupport.h:50
#6  0x0000555556c8c86e in Eigen::PardisoImpl<Eigen::PardisoLLT<Eigen::SparseMatrix<double, 0, int>, 2> >::factorize (this=0x555560b92348, a=...)
    at /home/yulin/.cache/CPM/eigen/9cb760a6fcc3143cbb7856ce4548b1cf4b385f2e/Eigen/src/PardisoSupport/PardisoSupport.h:311
#7  0x0000555556c83b38 in polysolve::linear::EigenDirect<Eigen::PardisoLLT<Eigen::SparseMatrix<double, 0, int>, 2> >::factorize (this=0x555560b92340, A=...)
    at /home/yulin/.cache/CPM/polysolve/238636140220c51ae4bf003f144270f07a52a00f/src/polysolve/linear/EigenSolver.tpp:47
--Type <RET> for more, q to quit, c to continue without paging--
#8  0x0000555556bcdb17 in polysolve::nonlinear::Newton::solve_sparse_linear_system (this=0x555560d90f70, objFunc=..., x=..., grad=..., direction=...)
    at /home/yulin/.cache/CPM/polysolve/238636140220c51ae4bf003f144270f07a52a00f/src/polysolve/nonlinear/descent_strategies/Newton.cpp:181
#9  0x0000555556bcd760 in polysolve::nonlinear::Newton::compute_update_direction (this=0x555560d90f70, objFunc=..., x=..., grad=..., direction=...)
    at /home/yulin/.cache/CPM/polysolve/238636140220c51ae4bf003f144270f07a52a00f/src/polysolve/nonlinear/descent_strategies/Newton.cpp:140
#10 0x0000555556baa8d2 in polysolve::nonlinear::Solver::compute_update_direction (this=0x7fffa001dc50, objFunc=..., x=..., grad=..., direction=...)
    at /home/yulin/.cache/CPM/polysolve/238636140220c51ae4bf003f144270f07a52a00f/src/polysolve/nonlinear/Solver.hpp:99
#11 0x0000555556ba6b38 in polysolve::nonlinear::Solver::minimize (this=0x7fffa001dc50, objFunc=..., x=...)
    at /home/yulin/.cache/CPM/polysolve/238636140220c51ae4bf003f144270f07a52a00f/src/polysolve/nonlinear/Solver.cpp:360
#12 0x00005555566c1a81 in polyfem::solver::ALSolver::solve_reduced (this=0x7fffffffaa00, 
    nl_solver=std::shared_ptr<polysolve::nonlinear::Solver> (use count 2, weak count 0) = {...}, nl_problem=..., sol=...)
    at /home/yulin/Documents/code/polyfem-1/src/polyfem/solver/ALSolver.cpp:107
#13 0x00005555568f8d30 in polyfem::mesh::constrained_L2_projection (

@zfergus
Copy link
Member

zfergus commented Jan 17, 2024

Ah, okay. I am using CHOLMOD instead of MKL Pardiso. I can try to reproduce it on a Linux machine with MKL support.

In the meantime try to switch to CHOLMOD on your side and see if that fixes the issue.

@liuyulinn
Copy link
Author

Thanks! But how can I switch to CHOLMOD?

@zfergus
Copy link
Member

zfergus commented Jan 17, 2024

Change

    "solver": {
        "linear": {
            "solver": ["Eigen::PardisoLLT", "Eigen::CholmodSupernodalLLT"]
        },

in scripts/common.json
to

    "solver": {
        "linear": {
            "solver": "Eigen::CholmodSupernodalLLT"
        },

@liuyulinn
Copy link
Author

Thanks! It works with CHOLMOD.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants