Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Intermittent Segmentation Faults on CI #404

Closed
willGraham01 opened this issue Apr 18, 2024 · 14 comments
Closed

[BUG] Intermittent Segmentation Faults on CI #404

willGraham01 opened this issue Apr 18, 2024 · 14 comments
Labels
bug Something isn't working

Comments

@willGraham01
Copy link
Collaborator

Moving the discussion from #403 here.

First reported: @K-Meech

That seems to fix codecov! No more codecov complaints on the latest actions run. The run tests with numba disabled action is hanging though, due to a segmentation fault. Not sure what is causing this - it seems unrelated to any changes in this PR.

@IgorTatarnikov

The run tests with numba disabled segfault seems to be sporadic. I ran into the same issue here but re-running it made it happy. Not sure what's going on!

@willGraham01

My only guess off the top of my head would be that the runs got runners with different specs (maybe an older runner vs a newer one)? But looking at the attempts Igor linked to, it seems it's the same machine (specs-wise) so that rules that out.

Can we replicate this seg-fault locally? Or is it just a CI thing?

@IgorTatarnikov

... Seems I marked my previous as duplicate and can't undo it. Reposting for posterity's sake.

I tried running the tests locally on Ubuntu 22.04 having set NUMBA_DISABLE_JIT=1 but I couldn't replicate the seg fault over 5 runs. I first ran all tests, then focused on running just test_detection.py to save time. Seems it might be a CI thing?

@willGraham01

A brief glance at the logs indicates that the seg-fault is thrown during garbage collection whilst in a multiprocessing thread.

Python's garbage collector is not deterministic, which is the main motivator behind my following guesses 😅

  • One thread is finishing / exiting earlier than the others, and cleans up some (implicitly) shared memory before the other threads can finish? This is also supported by Kimberly's discovery that the most recent test hung for 6hrs then was killed - we might end be ending up in a deadlock caused by something like this.
  • We're not respecting the private/shared memory for each thread. Don't know if Python multiprocessing cares about these concepts in the same way that something like C++ does, though.
  • Some combination of the above combined with how GitHub runners handle threading.

But I could be well off the money in each of those though.

@willGraham01 willGraham01 added the bug Something isn't working label Apr 18, 2024
@willGraham01
Copy link
Collaborator Author

willGraham01 commented Apr 18, 2024

@adamltyson

Can this be reproduced locally be limiting the number of CPUs cellfinder can use? This may explain why it only happens on the relatively low-spec GitHub actions runners.

Possibly? Ubuntu runners on GitHub only have 4 cores I believe so there's an upper thread limit.

@willGraham01 willGraham01 mentioned this issue Apr 18, 2024
7 tasks
@IgorTatarnikov
Copy link
Member

Success! I was able to replicate the failure on an Ubuntu 22.04 VM with 4 GB of RAM and 4 cores.

More clues in the full trace:

================================================================== FAILURES ===================================================================
_______________________________________________________________ test_callbacks ________________________________________________________________

signal_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>
background_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>, no_free_cpus = 0

    def test_callbacks(signal_array, background_array, no_free_cpus):
        # 20 is minimum number of planes needed to find > 0 cells
        signal_array = signal_array[0:20]
        background_array = background_array[0:20]
    
        planes_done = []
        batches_classified = []
        points_found = []
    
        def detect_callback(plane):
            planes_done.append(plane)
    
        def classify_callback(batch):
            batches_classified.append(batch)
    
        def detect_finished_callback(points):
            points_found.append(points)
    
>       main(
            signal_array,
            background_array,
            voxel_sizes,
            detect_callback=detect_callback,
            classify_callback=classify_callback,
            detect_finished_callback=detect_finished_callback,
            n_free_cpus=no_free_cpus,
        )

tests/core/test_integration/test_detection.py:125: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cellfinder/core/main.py:70: in main
    points = detect.main(
cellfinder/core/detect/detect.py:222: in main
    async_results, locks = _map_with_locks(
cellfinder/core/detect/detect.py:279: in _map_with_locks
    locks = [m.Lock() for _ in range(len(iterable))]
cellfinder/core/detect/detect.py:279: in <listcomp>
    locks = [m.Lock() for _ in range(len(iterable))]
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:728: in temp
    conn = self._Client(token.address, authkey=self._authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:508: in Client
    answer_challenge(c, authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:752: in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:216: in recv_bytes
    buf = self._recv_bytes(maxlength)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:414: in _recv_bytes
    buf = self._recv(4)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x72ffa069af80>, size = 4, read = <built-in function read>

    def _recv(self, size, read=_read):
        buf = io.BytesIO()
        handle = self._handle
        remaining = size
        while remaining > 0:
            chunk = read(handle, remaining)
            n = len(chunk)
            if n == 0:
                if remaining == size:
>                   raise EOFError
E                   EOFError

../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:383: EOFError

@IgorTatarnikov
Copy link
Member

Seems the error itself is sporadic!

_______________________________________________________________ test_callbacks ________________________________________________________________

signal_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>
background_array = dask.array<getitem, shape=(20, 510, 667), dtype=uint16, chunksize=(1, 510, 667), chunktype=numpy.ndarray>, no_free_cpus = 0

    def test_callbacks(signal_array, background_array, no_free_cpus):
        # 20 is minimum number of planes needed to find > 0 cells
        signal_array = signal_array[0:20]
        background_array = background_array[0:20]
    
        planes_done = []
        batches_classified = []
        points_found = []
    
        def detect_callback(plane):
            planes_done.append(plane)
    
        def classify_callback(batch):
            batches_classified.append(batch)
    
        def detect_finished_callback(points):
            points_found.append(points)
    
>       main(
            signal_array,
            background_array,
            voxel_sizes,
            detect_callback=detect_callback,
            classify_callback=classify_callback,
            detect_finished_callback=detect_finished_callback,
            n_free_cpus=no_free_cpus,
        )

tests/core/test_integration/test_detection.py:125: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
cellfinder/core/main.py:70: in main
    points = detect.main(
cellfinder/core/detect/detect.py:222: in main
    async_results, locks = _map_with_locks(
cellfinder/core/detect/detect.py:279: in _map_with_locks
    locks = [m.Lock() for _ in range(len(iterable))]
cellfinder/core/detect/detect.py:279: in <listcomp>
    locks = [m.Lock() for _ in range(len(iterable))]
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:724: in temp
    proxy = proxytype(
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:792: in __init__
    self._incref()
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/managers.py:846: in _incref
    conn = self._Client(self._token.address, authkey=self._authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:508: in Client
    answer_challenge(c, authkey)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:752: in answer_challenge
    message = connection.recv_bytes(256)         # reject large message
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:216: in recv_bytes
    buf = self._recv_bytes(maxlength)
../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:414: in _recv_bytes
    buf = self._recv(4)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

self = <multiprocessing.connection.Connection object at 0x738964243280>, size = 4, read = <built-in function read>

    def _recv(self, size, read=_read):
        buf = io.BytesIO()
        handle = self._handle
        remaining = size
        while remaining > 0:
>           chunk = read(handle, remaining)
E           ConnectionResetError: [Errno 104] Connection reset by peer

../../miniforge3/envs/cellfinder/lib/python3.10/multiprocessing/connection.py:379: ConnectionResetError

@adamltyson
Copy link
Member

@IgorTatarnikov cellfinder allows you to limit the number of CPU cores used. Just in case that's enough to reproduce, and we don't need to spin up VMs.

@willGraham01
Copy link
Collaborator Author

ConnectionResetError: [Errno 104] Connection reset by peer

and

raise EOFError

both raised in _recv which (I presume is multiprocessing's method of passing information between pooled threads).

Maybe we're opening a file and asynchronously writing to it when we shouldn't? Either way, it looks like something that's supposed to be shared across the threads isn't being treated properly.

Could try running valgrind across the failing test to get more info on what Python's trying to read?

@IgorTatarnikov
Copy link
Member

Interestingly, I couldn't reproduce the failure no matter how I played with the n_free_cpus parameter. Could only reproduce in the VM.

Setting the tests to run with 1 free CPU core seemed to make it disappear. Can we just run the test suite keeping at least one CPU core free? Change this to be one_free_cpu.

@willGraham01
Copy link
Collaborator Author

willGraham01 commented Apr 19, 2024

Can we just run the test suite keeping at least one CPU core free? Change this to be one_free_cpu.

I feel like this isn't the healthiest approach - I imagine it's not uncommon for our users to want to run cellfinder using 100% of their machine's resources so we should be aware there is a potential problem with that.

Though if it can only be replicated on VMs (which I presume includes GH runners) maybe the bug lies in there. Maybe on a VM our method of reading the number of available cores is incorrect?

@adamltyson
Copy link
Member

I imagine it's not uncommon for our users to want to run cellfinder using 100% of their machine's resources so we should be aware there is a potential problem with that.

I think the default is to always leave 2 CPU cores free though, as we've often observed issues.

@IgorTatarnikov
Copy link
Member

IgorTatarnikov commented Apr 19, 2024

It seems to lie at the intersection of low core count machine with JIT compilation disabled in numba. If I set the VM core count to 8 I can no longer reproduce the error.

I'm running the tests with valgrind to see if that give us any new information

@adamltyson
Copy link
Member

Maybe on a VM our method of reading the number of available cores is incorrect?

This happens on HPC systems too (other than SLURM which we explicitly support), when the number of cores that's available doesn't match the number of cores Python can "see".

@willGraham01
Copy link
Collaborator Author

This happens on HPC systems too (other than SLURM which we explicitly support), when the number of cores that's available doesn't match the number of cores Python can "see".

Could be the same issue here then. Maybe we shouldn't be using os.cpu_count() (or whatever function we're using to read the CPU count) to set the number of CPUs?

@adamltyson
Copy link
Member

At least on SLURM, it didn't seem to matter what function was used, it always returned the number of physical CPU cores on the machine. The only way to find the number allocated by SLURM was to directly interface with the scheduler.

@IgorTatarnikov
Copy link
Member

I can no longer reproduce when running on main. I ran the tests 33 times without a single failure. I'm happy to close this issue for now and reopen if it crops back up again!

@adamltyson
Copy link
Member

👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants