[BUG] Intermittent Segmentation Faults on CI #404

willGraham01 · 2024-04-18T09:33:34Z

Moving the discussion from #403 here.

That seems to fix codecov! No more codecov complaints on the latest actions run. The run tests with numba disabled action is hanging though, due to a segmentation fault. Not sure what is causing this - it seems unrelated to any changes in this PR.

@IgorTatarnikov

The run tests with numba disabled segfault seems to be sporadic. I ran into the same issue here but re-running it made it happy. Not sure what's going on!

@willGraham01

My only guess off the top of my head would be that the runs got runners with different specs (maybe an older runner vs a newer one)? But looking at the attempts Igor linked to, it seems it's the same machine (specs-wise) so that rules that out.

Can we replicate this seg-fault locally? Or is it just a CI thing?

@IgorTatarnikov

... Seems I marked my previous as duplicate and can't undo it. Reposting for posterity's sake.

I tried running the tests locally on Ubuntu 22.04 having set NUMBA_DISABLE_JIT=1 but I couldn't replicate the seg fault over 5 runs. I first ran all tests, then focused on running just test_detection.py to save time. Seems it might be a CI thing?

@willGraham01

A brief glance at the logs indicates that the seg-fault is thrown during garbage collection whilst in a multiprocessing thread.

Python's garbage collector is not deterministic, which is the main motivator behind my following guesses 😅

One thread is finishing / exiting earlier than the others, and cleans up some (implicitly) shared memory before the other threads can finish? This is also supported by Kimberly's discovery that the most recent test hung for 6hrs then was killed - we might end be ending up in a deadlock caused by something like this.

We're not respecting the private/shared memory for each thread. Don't know if Python multiprocessing cares about these concepts in the same way that something like C++ does, though.

Some combination of the above combined with how GitHub runners handle threading.

But I could be well off the money in each of those though.

willGraham01 · 2024-04-18T09:33:48Z

@adamltyson

Can this be reproduced locally be limiting the number of CPUs cellfinder can use? This may explain why it only happens on the relatively low-spec GitHub actions runners.

Possibly? Ubuntu runners on GitHub only have 4 cores I believe so there's an upper thread limit.

IgorTatarnikov · 2024-04-18T16:29:47Z

Success! I was able to replicate the failure on an Ubuntu 22.04 VM with 4 GB of RAM and 4 cores.

More clues in the full trace:

================================================================== FAILURES ===================================================================
_______________________________________________________________ test_callbacks ________________________________________________________________

signal_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>
background_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>, no_free_cpus = 0

    def test_callbacks(signal_array, background_array, no_free_cpus):
        # 20 is minimum number of planes needed to find > 0 cells
        signal_array = signal_array[0:20]
        background_array = background_array[0:20]
    
        planes_done = []
        batches_classified = []
        points_found = []
    
        def detect_callback(plane):
            planes_done.append(plane)
    
        def classify_callback(batch):
            batches_classified.append(batch)
    
        def detect_finished_callback(points):
            points_found.append(points)
    
>       main(
            signal_array,
            background_array,
            voxel_sizes,
            detect_callback=detect_callback,
            classify_callback=classify_callback,
            detect_finished_callback=detect_finished_callback,
            n_free_cpus=no_free_cpus,
        )

tests/core/test_integration/test_detection.py:125: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cellfinder/core/main.py:70: in main
    points = detect.main(
cellfinder/core/detect/detect.py:222: in main
    async_results, locks = _map_with_locks(
cellfinder/core/detect/detect.py:279: in _map_with_locks
    locks = [m.Lock() for _ in range(len(iterable))]
cellfinder/core/detect/detect.py:279: in <listcomp>
    locks = [m.Lock() for _ in range(len(iterable))]
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:728: in temp
    conn = self._Client(token.address, authkey=self._authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:508: in Client
    answer_challenge(c, authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:752: in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:216: in recv_bytes
    buf = self._recv_bytes(maxlength)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:414: in _recv_bytes
    buf = self._recv(4)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x72ffa069af80>, size = 4, read = <built-in function read>

    def _recv(self, size, read=_read):
        buf = io.BytesIO()
        handle = self._handle
        remaining = size
        while remaining > 0:
            chunk = read(handle, remaining)
            n = len(chunk)
            if n == 0:
                if remaining == size:
>                   raise EOFError
E                   EOFError

../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:383: EOFError

IgorTatarnikov · 2024-04-18T16:39:48Z

Seems the error itself is sporadic!

_______________________________________________________________ test_callbacks ________________________________________________________________

signal_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>
background_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>, no_free_cpus = 0

    def test_callbacks(signal_array, background_array, no_free_cpus):
        # 20 is minimum number of planes needed to find > 0 cells
        signal_array = signal_array[0:20]
        background_array = background_array[0:20]
    
        planes_done = []
        batches_classified = []
        points_found = []
    
        def detect_callback(plane):
            planes_done.append(plane)
    
        def classify_callback(batch):
            batches_classified.append(batch)
    
        def detect_finished_callback(points):
            points_found.append(points)
    
>       main(
            signal_array,
            background_array,
            voxel_sizes,
            detect_callback=detect_callback,
            classify_callback=classify_callback,
            detect_finished_callback=detect_finished_callback,
            n_free_cpus=no_free_cpus,
        )

tests/core/test_integration/test_detection.py:125: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cellfinder/core/main.py:70: in main
    points = detect.main(
cellfinder/core/detect/detect.py:222: in main
    async_results, locks = _map_with_locks(
cellfinder/core/detect/detect.py:279: in _map_with_locks
    locks = [m.Lock() for _ in range(len(iterable))]
cellfinder/core/detect/detect.py:279: in <listcomp>
    locks = [m.Lock() for _ in range(len(iterable))]
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:724: in temp
    proxy = proxytype(
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:792: in __init__
    self._incref()
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:846: in _incref
    conn = self._Client(self._token.address, authkey=self._authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:508: in Client
    answer_challenge(c, authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:752: in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:216: in recv_bytes
    buf = self._recv_bytes(maxlength)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:414: in _recv_bytes
    buf = self._recv(4)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x738964243280>, size = 4, read = <built-in function read>

    def _recv(self, size, read=_read):
        buf = io.BytesIO()
        handle = self._handle
        remaining = size
        while remaining > 0:
>           chunk = read(handle, remaining)
E           ConnectionResetError: [Errno 104] Connection reset by peer

../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:379: ConnectionResetError

adamltyson · 2024-04-18T16:47:25Z

@IgorTatarnikov cellfinder allows you to limit the number of CPU cores used. Just in case that's enough to reproduce, and we don't need to spin up VMs.

willGraham01 · 2024-04-19T08:28:08Z

ConnectionResetError: [Errno 104] Connection reset by peer

and

raise EOFError

both raised in _recv which (I presume is multiprocessing's method of passing information between pooled threads).

Maybe we're opening a file and asynchronously writing to it when we shouldn't? Either way, it looks like something that's supposed to be shared across the threads isn't being treated properly.

Could try running valgrind across the failing test to get more info on what Python's trying to read?

IgorTatarnikov · 2024-04-19T09:30:04Z

Interestingly, I couldn't reproduce the failure no matter how I played with the n_free_cpus parameter. Could only reproduce in the VM.

Setting the tests to run with 1 free CPU core seemed to make it disappear. Can we just run the test suite keeping at least one CPU core free? Change this to be one_free_cpu.

willGraham01 · 2024-04-19T09:35:36Z

Can we just run the test suite keeping at least one CPU core free? Change this to be one_free_cpu.

I feel like this isn't the healthiest approach - I imagine it's not uncommon for our users to want to run cellfinder using 100% of their machine's resources so we should be aware there is a potential problem with that.

Though if it can only be replicated on VMs (which I presume includes GH runners) maybe the bug lies in there. Maybe on a VM our method of reading the number of available cores is incorrect?

adamltyson · 2024-04-19T09:43:45Z

I imagine it's not uncommon for our users to want to run cellfinder using 100% of their machine's resources so we should be aware there is a potential problem with that.

I think the default is to always leave 2 CPU cores free though, as we've often observed issues.

IgorTatarnikov · 2024-04-19T09:44:06Z

It seems to lie at the intersection of low core count machine with JIT compilation disabled in numba. If I set the VM core count to 8 I can no longer reproduce the error.

I'm running the tests with valgrind to see if that give us any new information

adamltyson · 2024-04-19T09:44:43Z

Maybe on a VM our method of reading the number of available cores is incorrect?

This happens on HPC systems too (other than SLURM which we explicitly support), when the number of cores that's available doesn't match the number of cores Python can "see".

willGraham01 · 2024-04-19T09:49:12Z

This happens on HPC systems too (other than SLURM which we explicitly support), when the number of cores that's available doesn't match the number of cores Python can "see".

Could be the same issue here then. Maybe we shouldn't be using os.cpu_count() (or whatever function we're using to read the CPU count) to set the number of CPUs?

adamltyson · 2024-04-19T10:00:06Z

At least on SLURM, it didn't seem to matter what function was used, it always returned the number of physical CPU cores on the machine. The only way to find the number allocated by SLURM was to directly interface with the scheduler.

IgorTatarnikov · 2024-05-09T10:31:46Z

I can no longer reproduce when running on main. I ran the tests 33 times without a single failure. I'm happy to close this issue for now and reopen if it crops back up again!

adamltyson · 2024-05-09T10:37:25Z

👍

willGraham01 added the bug Something isn't working label Apr 18, 2024

willGraham01 mentioned this issue Apr 18, 2024

Add codecov token #403

Merged

7 tasks

sfmig mentioned this issue Apr 23, 2024

Remove modular asv benchmarks #406

Merged

7 tasks

IgorTatarnikov mentioned this issue May 2, 2024

Add timeout and continue on error flags to numba disabled tests #410

Closed

7 tasks

adamltyson closed this as completed May 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Intermittent Segmentation Faults on CI #404

[BUG] Intermittent Segmentation Faults on CI #404

willGraham01 commented Apr 18, 2024

willGraham01 commented Apr 18, 2024 •

edited

IgorTatarnikov commented Apr 18, 2024

IgorTatarnikov commented Apr 18, 2024

adamltyson commented Apr 18, 2024

willGraham01 commented Apr 19, 2024

IgorTatarnikov commented Apr 19, 2024

willGraham01 commented Apr 19, 2024 •

edited

adamltyson commented Apr 19, 2024

IgorTatarnikov commented Apr 19, 2024 •

edited

adamltyson commented Apr 19, 2024

willGraham01 commented Apr 19, 2024

adamltyson commented Apr 19, 2024

IgorTatarnikov commented May 9, 2024

adamltyson commented May 9, 2024

[BUG] Intermittent Segmentation Faults on CI #404

[BUG] Intermittent Segmentation Faults on CI #404

Comments

willGraham01 commented Apr 18, 2024

willGraham01 commented Apr 18, 2024 • edited

IgorTatarnikov commented Apr 18, 2024

IgorTatarnikov commented Apr 18, 2024

adamltyson commented Apr 18, 2024

willGraham01 commented Apr 19, 2024

IgorTatarnikov commented Apr 19, 2024

willGraham01 commented Apr 19, 2024 • edited

adamltyson commented Apr 19, 2024

IgorTatarnikov commented Apr 19, 2024 • edited

adamltyson commented Apr 19, 2024

willGraham01 commented Apr 19, 2024

adamltyson commented Apr 19, 2024

IgorTatarnikov commented May 9, 2024

adamltyson commented May 9, 2024

willGraham01 commented Apr 18, 2024 •

edited

willGraham01 commented Apr 19, 2024 •

edited

IgorTatarnikov commented Apr 19, 2024 •

edited