New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crashes when running predict_instances on a 3D image #235
Comments
Hi @ximion and sorry for the late reply.
The segfault happens in one of StarDist's C++ extensions (used for post-processing) and might be related due to shared libraries issues and/or OpenMP parallelization. You can control the number of threads with the environment variable import os
os.environ['OMP_NUM_THREADS'] = '1' Even if this runs without crashing, it doesn't really solve the problem because the code will run much slower. |
@uschmidt83 Hi, sorry for the long delay, I had to wait a bit for some hardware repairs, but everything is fully up and running again (except for StarDist ^^). I tried your suggestion, but it still crashes the same way as before: Thread 1 "python" received signal SIGSEGV, Segmentation fault.
0x00007fffdb7a605e in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1516
1516 stardist/lib/stardist3d_impl.cpp: No such file or directory. The only observed difference was it running slightly slower. |
There is also no difference whether I execute in on the GPU or CPU (would not have assumed that to be the case, but tried it just to be sure). |
This is an attempt to resolve stardist#235, however it does not fully resolve the underlying cause but seems to make it extremely unlikely. The actual issue appears to be a kind of stack corruption caused elsewhere in the program.
So, I've tracked the crash down to the It would likely take an extended Valgrind session to track this down, which is extra painful because TensorFlow is in the mix (and will yield tons of issues which may or may not be false positives), Python as well and in addition to that even with a small example we are allocating huge chunks of memory. Somehow though, upgrading Qhull made the issue appear extremely rarely, and adding guards around the One thing to note is that the code always crashes in Asan trace for reference: AddressSanitizer: CHECK failed: asan_errors.cpp:109 "((free_stack->size)) > ((0))" (0x0, 0x0) (tid=249280)
#0 0x7f7befec22aa in CheckUnwind ../../../../src/libsanitizer/asan/asan_rtl.cpp:67
#1 0x7f7befee2c55 in __sanitizer::CheckFailed(char const*, int, char const*, unsigned long long, unsigned long long) ../../../../src/libsanitizer/sanitizer_common/sanitizer_termination.cpp:86
#2 0x7f7befe3078c in __asan::ErrorFreeNotMalloced::Print() ../../../../src/libsanitizer/asan/asan_errors.cpp:109
#3 0x7f7befec1e27 in __asan::ScopedInErrorReport::~ScopedInErrorReport() ../../../../src/libsanitizer/asan/asan_report.cpp:141
#4 0x7f7befebef1e in __asan::ReportFreeNotMalloced(unsigned long, __sanitizer::BufferedStackTrace*) ../../../../src/libsanitizer/asan/asan_report.cpp:239
#5 0x7f7befeb9ed7 in operator delete(void*) ../../../../src/libsanitizer/asan/asan_new_delete.cpp:152
#6 0x7f7bed967399 in std::__new_allocator<char>::deallocate(char*, unsigned long) /usr/include/c++/12/bits/new_allocator.h:158
#7 0x7f7bed967399 in std::allocator_traits<std::allocator<char> >::deallocate(std::allocator<char>&, char*, unsigned long) /usr/include/c++/12/bits/alloc_traits.h:496
#8 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_destroy(unsigned long) /usr/include/c++/12/bits/basic_string.h:292
#9 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::_M_dispose() /usr/include/c++/12/bits/basic_string.h:286
#10 0x7f7bed967399 in std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >::~basic_string() /usr/include/c++/12/bits/basic_string.h:795
#11 0x7f7bed967399 in qh_fprintf stardist/lib/external/qhull_src/src/libqhullcpp/QhullUser.cpp:220
#12 0x7f7bed87c359 in qh_sethalfspace stardist/lib/external/qhull_src/src/libqhull_r/geom2_r.c:2051
#13 0x7f7bed87cb4a in qh_sethalfspace_all stardist/lib/external/qhull_src/src/libqhull_r/geom2_r.c:2104
#14 0x7f7bed949c63 in orgQhull::Qhull::runQhull(char const*, int, int, double const*, char const*) stardist/lib/external/qhull_src/src/libqhullcpp/Qhull.cpp:326
#15 0x7f7bed98e150 in qhull_volume_halfspace_intersection(double const*, double const*, int, float) stardist/lib/stardist3d_impl.cpp:692
#16 0x7f7bed986f59 in qhull_overlap_kernel(float const*, float const*, float const*, float const*, int const*, int, int, int) stardist/lib/stardist3d_impl.cpp:862
#17 0x7f7bed986f59 in _COMMON_non_maximum_suppression_sparse(float const*, float const*, float const*, int, int, int, float const*, int const*, float, int, int, int, bool*) [clone ._omp_fn.2] stardist/lib/stardist3d_impl.cpp:1261
#18 0x7f7bfc7d4d8d (/lib/x86_64-linux-gnu/libgomp.so.1+0x1cd8d)
#19 0x7f7d04e77043 (/lib/x86_64-linux-gnu/libc.so.6+0x89043)
#20 0x7f7d04ef75fb (/lib/x86_64-linux-gnu/libc.so.6+0x1095fb) |
Very interesting, since the issue was back with my large dataset, I decided to comment out the offending message-print lines. #0 0x00007ffedf4d76f9 in _Z27_COMMON_polyhedron_to_labelPKfS0_S0_PKiiiiS2_iiiiiiiPi._omp_fn.0(void) () at stardist/lib/stardist3d_impl.cpp:1514
new_label_if_labeled = <optimized out>
inside = <optimized out>
offset = <optimized out>
x = <optimized out>
y = 1148
z = 131
faces = 0xbe682c0
n_rays = <optimized out>
n_faces = 188
labels = 0x1802d800
nz = <optimized out>
ny = 4070
nx = 4153
render_mode = 0
use_overlap_label = 0
overlap_label = 0
result = 0x7ff0e9bf2010
polyverts = 0x619000849e80
bbox = {127, 135, 1147, 1192, 3717, 3758}
i = <optimized out>
curr_center = 0x1c6f9b8c
hs_convex = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
_M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x62500191e100,
_M_finish = 0x62500191f800, _M_end_of_storage = 0x625001920100}, <No data fields>}}, <No data fields>}
hs_kernel = {<std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >> = {
_M_impl = {<std::allocator<std::array<double, 4> >> = {<std::__new_allocator<std::array<double, 4> >> = {<No data fields>}, <No data fields>}, <std::_Vector_base<std::array<double, 4>, std::allocator<std::array<double, 4> > >::_Vector_impl_data> = {_M_start = 0x625000992100,
_M_finish = 0x625000993880, _M_end_of_storage = 0x625000994100}, <No data fields>}}, <No data fields>}
#1 0x00007ffeef5720b6 in GOMP_parallel () from /lib/x86_64-linux-gnu/libgomp.so.1
No symbol table info available.
#2 0x00007ffedf4e8e6b in _COMMON_polyhedron_to_label (dist=dist@entry=0x1baeb450, points=points@entry=0x1c6f9430, verts=verts@entry=0xa527a80,
faces=faces@entry=0xbe682c0, n_polys=n_polys@entry=839, n_rays=n_rays@entry=96, n_faces=<optimized out>, labels=<optimized out>,
nz=<optimized out>, ny=<optimized out>, nx=<optimized out>, render_mode=<optimized out>, verbose=<optimized out>,
use_overlap_label=<optimized out>, overlap_label=<optimized out>, result=<optimized out>) at stardist/lib/stardist3d_impl.cpp:1460
curr_dist = 0x1baf9fd0
curr_center = 0x1c6f9b8c
hs_convex = <optimized out>
hs_kernel = <optimized out>
i = 157
old_sigint_handler = 0x678c70
polyverts = 0x619000849e80
bbox = <optimized out>
#3 0x00007ffedf4be097 in c_polyhedron_to_label (self=<optimized out>, args=<optimized out>) at stardist/lib/stardist3d.cpp:130
arr_dist = <optimized out>
arr_points = <optimized out>
arr_verts = <optimized out>
arr_faces = <optimized out>
arr_labels = <optimized out>
arr_result = 0x7ffc640fcb10
nz = <optimized out>
ny = <optimized out>
nx = <optimized out>
render_mode = <optimized out>
verbose = <optimized out>
use_overlap_label = <optimized out>
overlap_label = <optimized out>
n_polys = 839
n_rays = 96
n_faces = 188
dist = 0x1baeb450
points = 0x1c6f9430
verts = 0xa527a80
faces = 0xbe682c0
labels = 0x1802d800
dims_result = <optimized out>
result = <optimized out>
#4 0x0000000000547ae8 in ?? ()
No symbol table info available.
#5 0x0000000000517eb3 in _PyObject_MakeTpCall () The offending line is int new_label_if_labeled = use_overlap_label?overlap_label:result[offset];
npy_intp dims_result[3];
dims_result[0] = nz;
dims_result[1] = ny;
dims_result[2] = nx;
// result array
arr_result = (PyArrayObject*)PyArray_ZEROS(3,dims_result,NPY_INT32,0);
int * result = (int*) PyArray_DATA(arr_result); @maweigert / @uschmidt83 If you could have a look at this, that would be very appreciated. I will also probably have a go at it next week, but you know the code a lot better and can navigate it easier - so the issue might be more obvious to you. I would love to give an example to reproduce, but the crash appears only reliably with a very large dataset. |
In a computation like x+y*nx+z*(nx*ny) if the sub-computation y*nx or z*(nx*ny) overflows the size of a 32-bit integer, we will get a bogus result even though the final destination is a 64-bit integer. The solution is to promote at least one operand to a bigger-sized integer. Since long can also be 32-bit on 32-bit architectures, we just use explicit named types here. This patch also fixes a few cases of comparisons of integers with different signs. Resolves: stardist#235
As soon as I turned off the computer yesterday, I realized the answer to this issue has been staring me right in the face... I still se a lot of compiler warnings about unused variables and integer comparisons, but those are (so far) unrelated :-) |
Hi!
Recently, with some newly trained models but data on which Stardist has worked before, I am observing crashes when running
model.predict_instances
(wheremodel
is aStarDist3D
).The inference will run to 100% completion, but then after being almost done, the software will crash with a segmentation fault.
To reproduce
The code used is basically from the example:
This also worked fine before, the only thing that has changed was that more labeled data was added for model training. The model itself was trained without issues.
Expected behavior
I would expect no crash, or at least an actionable message on what to do next, in case there was something wrong with the input data, instead of a segmentation fault.
Data and screenshots
Here is a debug backtrace of the crash - fortunately Stardist had debug symbols:
Environment:
If needed, I can probably share some of the training or sample data, but it is fairly large.
Thank you for making Stardist! :-)
The text was updated successfully, but these errors were encountered: