Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crashes when running predict_instances on a 3D image #235

Closed
ximion opened this issue Jun 27, 2023 · 6 comments · Fixed by #249 · May be fixed by #248
Closed

Crashes when running predict_instances on a 3D image #235

ximion opened this issue Jun 27, 2023 · 6 comments · Fixed by #249 · May be fixed by #248
Labels
bug Something isn't working

Comments

@ximion
Copy link
Contributor

ximion commented Jun 27, 2023

Hi!

Recently, with some newly trained models but data on which Stardist has worked before, I am observing crashes when running model.predict_instances (where model is a StarDist3D).

The inference will run to 100% completion, but then after being almost done, the software will crash with a segmentation fault.

To reproduce
The code used is basically from the example:

model = StarDist3D(None, name=model_name, basedir=model_dir)
img = normalize(img_input, 1, 99.8, axis=axis_norm)
labels, details = model.predict_instances(img, n_tiles=model._guess_n_tiles(img))

This also worked fine before, the only thing that has changed was that more labeled data was added for model training. The model itself was trained without issues.

Expected behavior
I would expect no crash, or at least an actionable message on what to do next, in case there was something wrong with the input data, instead of a segmentation fault.

Data and screenshots
Here is a debug backtrace of the crash - fortunately Stardist had debug symbols:

Thread 149054 "python" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7ffafdffb6c0 (LWP 1798761)]
0x00007ffeef9cf6b8 in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1516
1516    stardist/lib/stardist3d_impl.cpp: No such file or directory.
(gdb) bt full
#0  0x00007ffeef9cf6b8 in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1516
        new_label_if_labeled = <optimized out>
        inside = <optimized out>
        offset = <optimized out>
        x = 2434
        y = 38
        z = 229
        faces = <optimized out>
        n_rays = <optimized out>
        n_faces = <optimized out>
        labels = <optimized out>
        nz = <optimized out>
        ny = 4166
        nx = 4166
        render_mode = 0
        use_overlap_label = 0
        overlap_label = 0
        result = 0x7fef4f756010
        polyverts = <optimized out>
        bbox = {207, 317, 30, 247, 2357, 2626}
        i = <optimized out>
        curr_center = 0x77bcfe60
        hs_convex = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
            _M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x79857840, 
                _M_finish = 0x798585c0, _M_end_of_storage = 0x79858840}, <No data fields>}}, <No data fields>}
        hs_kernel = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
            _M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x77d68760, 
                _M_finish = 0x77d69ee0, _M_end_of_storage = 0x77d6a760}, <No data fields>}}, <No data fields>}

Environment:

  • StarDist version: 0.8.3
  • CSBDeep version: 0.7.3
  • TensorFlow version: 2.12.0
  • OS: Linux (Debian 12)
  • Training & inference run on an AMD EPYC 7443P CPU

If needed, I can probably share some of the training or sample data, but it is fairly large.
Thank you for making Stardist! :-)

@ximion ximion added the bug Something isn't working label Jun 27, 2023
@uschmidt83
Copy link
Member

Hi @ximion and sorry for the late reply.

The inference will run to 100% completion, but then after being almost done, the software will crash with a segmentation fault.

The segfault happens in one of StarDist's C++ extensions (used for post-processing) and might be related due to shared libraries issues and/or OpenMP parallelization.

You can control the number of threads with the environment variable OMP_NUM_THREADS. Do debug this, try setting this from within Python, e.g. by pasting this at the very top of your script or notebook:

import os
os.environ['OMP_NUM_THREADS'] = '1'

Even if this runs without crashing, it doesn't really solve the problem because the code will run much slower.

@ximion
Copy link
Contributor Author

ximion commented Aug 29, 2023

@uschmidt83 Hi, sorry for the long delay, I had to wait a bit for some hardware repairs, but everything is fully up and running again (except for StarDist ^^).

I tried your suggestion, but it still crashes the same way as before:

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffdb7a605e in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1516
1516    stardist/lib/stardist3d_impl.cpp: No such file or directory.

The only observed difference was it running slightly slower.

@ximion
Copy link
Contributor Author

ximion commented Aug 31, 2023

There is also no difference whether I execute in on the GPU or CPU (would not have assumed that to be the case, but tried it just to be sure).

ximion added a commit to ximion/stardist that referenced this issue Sep 1, 2023
This is an attempt to resolve stardist#235, however it does not fully resolve
the underlying cause but seems to make it extremely unlikely.
The actual issue appears to be a kind of stack corruption caused
elsewhere in the program.
@ximion ximion mentioned this issue Sep 1, 2023
@ximion
Copy link
Contributor Author

ximion commented Sep 1, 2023

So, I've tracked the crash down to the qh_fprintf function of Qhull, where the newMessage C array gets converted into a std::string implicitly, which is then subsequently deallocated and somehow frees memory that was never actually allocated. From my understanding of the code, this should not actually happen, so I suspect some stack corruption from elsewhere that does not occur due to OpenMP (even when compiling without it, the bug happens). So it's either due to thread race conditions in other part of the code, or a good old regular stack corruption. Very fun.
It probably appeared now due to some piece of the code relying on undefined behavior, and a compiler update has triggered it (at least that's my guess, could be completely wrong).

It would likely take an extended Valgrind session to track this down, which is extra painful because TensorFlow is in the mix (and will yield tons of issues which may or may not be false positives), Python as well and in addition to that even with a small example we are allocating huge chunks of memory.

Somehow though, upgrading Qhull made the issue appear extremely rarely, and adding guards around the QhullQh message functions made me not see the crash again (indicating threading issues, but I'd be careful with that conclusion - this might just be coincidental). Unfortunately, address sanitizer is still unhappy with the code, so the problem likely wasn't fix, just made to not manifest or manifest less.
Regardless, applying the above PR resolves(-ish) the issue for me! :-)

One thing to note is that the code always crashes in qh_fprintf - that, and given that the original stack trace was pointing to some other place, makes me think we have memory corruption. Address sanitizer didn't catch any bad writes though, so it may actually be a different library that I haven't checked corrupting memory here (because Asan really should have caught that!). So, looking into places that call the message functions would be a good next thing to do, followed by, unfortunately, Valgrind.

Asan trace for reference:

AddressSanitizer: CHECK failed: asan_errors.cpp:109 "((free_stack->size)) > ((0))" (0x0, 0x0) (tid=249280)
    #0 0x7f7befec22aa in CheckUnwind ../../../../src/libsanitizer/asan/asan_rtl.cpp:67
    #1 0x7f7befee2c55 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
    #2 0x7f7befe3078c in __asan::ErrorFreeNotMalloced::Print() ../../../../src/libsanitizer/asan/asan_errors.cpp:109
    #3 0x7f7befec1e27 in __asan::ScopedInErrorReport::~ScopedInErrorReport() ../../../../src/libsanitizer/asan/asan_report.cpp:141
    #4 0x7f7befebef1e in __asan::ReportFreeNotMalloced(unsigned long, __sanitizer::BufferedStackTrace*) ../../../../src/libsanitizer/asan/asan_report.cpp:239
    #5 0x7f7befeb9ed7 in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:152
    #6 0x7f7bed967399 in std::__new_allocator<char>::deallocate(char*, unsigned long) /usr/include/c++/12/bits/new_allocator.h:158
    #7 0x7f7bed967399 in std::allocator_traits<std::allocator<char> >::deallocate(std::allocator<char>&, char*, unsigned long) /usr/include/c++/12/bits/alloc_traits.h:496
    #8 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_destroy(unsigned long) /usr/include/c++/12/bits/basic_string.h:292
    #9 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() /usr/include/c++/12/bits/basic_string.h:286
    #10 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() /usr/include/c++/12/bits/basic_string.h:795
    #11 0x7f7bed967399 in qh_fprintf stardist/lib/external/qhull_src/src/libqhullcpp/QhullUser.cpp:220
    #12 0x7f7bed87c359 in qh_sethalfspace stardist/lib/external/qhull_src/src/libqhull_r/geom2_r.c:2051
    #13 0x7f7bed87cb4a in qh_sethalfspace_all stardist/lib/external/qhull_src/src/libqhull_r/geom2_r.c:2104
    #14 0x7f7bed949c63 in orgQhull::Qhull::runQhull(char const*, int, int, double const*, char const*) stardist/lib/external/qhull_src/src/libqhullcpp/Qhull.cpp:326
    #15 0x7f7bed98e150 in qhull_volume_halfspace_intersection(double const*, double const*, int, float) stardist/lib/stardist3d_impl.cpp:692
    #16 0x7f7bed986f59 in qhull_overlap_kernel(float const*, float const*, float const*, float const*, int const*, int, int, int) stardist/lib/stardist3d_impl.cpp:862
    #17 0x7f7bed986f59 in _COMMON_non_maximum_suppression_sparse(float const*, float const*, float const*, int, int, int, float const*, int const*, float, int, int, int, bool*) [clone ._omp_fn.2] stardist/lib/stardist3d_impl.cpp:1261
    #18 0x7f7bfc7d4d8d  (/lib/x86_64-linux-gnu/libgomp.so.1+0x1cd8d)
    #19 0x7f7d04e77043  (/lib/x86_64-linux-gnu/libc.so.6+0x89043)
    #20 0x7f7d04ef75fb  (/lib/x86_64-linux-gnu/libc.so.6+0x1095fb)

@ximion
Copy link
Contributor Author

ximion commented Sep 2, 2023

Very interesting, since the issue was back with my large dataset, I decided to comment out the offending message-print lines.
As soon as they were gone, GDB gave me the original crash backtrace again:

#0  0x00007ffedf4d76f9 in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1514
        new_label_if_labeled = <optimized out>
        inside = <optimized out>
        offset = <optimized out>
        x = <optimized out>
        y = 1148
        z = 131
        faces = 0xbe682c0
        n_rays = <optimized out>
        n_faces = 188
        labels = 0x1802d800
        nz = <optimized out>
        ny = 4070
        nx = 4153
        render_mode = 0
        use_overlap_label = 0
        overlap_label = 0
        result = 0x7ff0e9bf2010
        polyverts = 0x619000849e80
        bbox = {127, 135, 1147, 1192, 3717, 3758}
        i = <optimized out>
        curr_center = 0x1c6f9b8c
        hs_convex = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
            _M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x62500191e100, 
                _M_finish = 0x62500191f800, _M_end_of_storage = 0x625001920100}, <No data fields>}}, <No data fields>}
        hs_kernel = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
            _M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x625000992100, 
                _M_finish = 0x625000993880, _M_end_of_storage = 0x625000994100}, <No data fields>}}, <No data fields>}
#1  0x00007ffeef5720b6 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
No symbol table info available.
#2  0x00007ffedf4e8e6b in _COMMON_polyhedron_to_label (dist=dist@entry=0x1baeb450, points=points@entry=0x1c6f9430, verts=verts@entry=0xa527a80, 
    faces=faces@entry=0xbe682c0, n_polys=n_polys@entry=839, n_rays=n_rays@entry=96, n_faces=<optimized out>, labels=<optimized out>, 
    nz=<optimized out>, ny=<optimized out>, nx=<optimized out>, render_mode=<optimized out>, verbose=<optimized out>, 
    use_overlap_label=<optimized out>, overlap_label=<optimized out>, result=<optimized out>) at stardist/lib/stardist3d_impl.cpp:1460
        curr_dist = 0x1baf9fd0
        curr_center = 0x1c6f9b8c
        hs_convex = <optimized out>
        hs_kernel = <optimized out>
        i = 157
        old_sigint_handler = 0x678c70
        polyverts = 0x619000849e80
        bbox = <optimized out>
#3  0x00007ffedf4be097 in c_polyhedron_to_label (self=<optimized out>, args=<optimized out>) at stardist/lib/stardist3d.cpp:130
        arr_dist = <optimized out>
        arr_points = <optimized out>
        arr_verts = <optimized out>
        arr_faces = <optimized out>
        arr_labels = <optimized out>
        arr_result = 0x7ffc640fcb10
        nz = <optimized out>
        ny = <optimized out>
        nx = <optimized out>
        render_mode = <optimized out>
        verbose = <optimized out>
        use_overlap_label = <optimized out>
        overlap_label = <optimized out>
        n_polys = 839
        n_rays = 96
        n_faces = 188
        dist = 0x1baeb450
        points = 0x1c6f9430
        verts = 0xa527a80
        faces = 0xbe682c0
        labels = 0x1802d800
        dims_result = <optimized out>
        result = <optimized out>
#4  0x0000000000547ae8 in ?? ()
No symbol table info available.
#5  0x0000000000517eb3 in _PyObject_MakeTpCall ()

The offending line is

int new_label_if_labeled = use_overlap_label?overlap_label:result[offset];

offset was unfortunately optimized out, but since result appears to be a valid pointer, the offset is probably wrong. Either that, or the array construction is wrong:

  npy_intp dims_result[3];
  dims_result[0] = nz;
  dims_result[1] = ny;
  dims_result[2] = nx;

  // result array 
  arr_result = (PyArrayObject*)PyArray_ZEROS(3,dims_result,NPY_INT32,0);

  int * result = (int*) PyArray_DATA(arr_result);

@maweigert / @uschmidt83 If you could have a look at this, that would be very appreciated. I will also probably have a go at it next week, but you know the code a lot better and can navigate it easier - so the issue might be more obvious to you. I would love to give an example to reproduce, but the crash appears only reliably with a very large dataset.

ximion added a commit to ximion/stardist that referenced this issue Sep 2, 2023
In a computation like x+y*nx+z*(nx*ny) if the sub-computation y*nx or
z*(nx*ny) overflows the size of a 32-bit integer, we will get a bogus
result even though the final destination is a 64-bit integer.
The solution is to promote at least one operand to a bigger-sized
integer. Since long can also be 32-bit on 32-bit architectures, we just
use explicit named types here.
This patch also fixes a few cases of comparisons of integers with
different signs.

Resolves: stardist#235
@ximion
Copy link
Contributor Author

ximion commented Sep 2, 2023

As soon as I turned off the computer yesterday, I realized the answer to this issue has been staring me right in the face...
It's just a classic integer overflow, easily fixed actually (and one of the really nice traps that C sets for programmers...).
With the PR above, this issue is fully resolved. ASan still complains, but the program doesn't seem to crash anymore.

I still se a lot of compiler warnings about unused variables and integer comparisons, but those are (so far) unrelated :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
2 participants