New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Intermittent Segmentation Faults on CI #404
Comments
Possibly? Ubuntu runners on GitHub only have 4 cores I believe so there's an upper thread limit. |
Success! I was able to replicate the failure on an Ubuntu 22.04 VM with 4 GB of RAM and 4 cores. More clues in the full trace:
|
Seems the error itself is sporadic!
|
@IgorTatarnikov cellfinder allows you to limit the number of CPU cores used. Just in case that's enough to reproduce, and we don't need to spin up VMs. |
and
both raised in Maybe we're opening a file and asynchronously writing to it when we shouldn't? Either way, it looks like something that's supposed to be shared across the threads isn't being treated properly. Could try running |
Interestingly, I couldn't reproduce the failure no matter how I played with the n_free_cpus parameter. Could only reproduce in the VM. Setting the tests to run with 1 free CPU core seemed to make it disappear. Can we just run the test suite keeping at least one CPU core free? Change this to be |
I feel like this isn't the healthiest approach - I imagine it's not uncommon for our users to want to run Though if it can only be replicated on VMs (which I presume includes GH runners) maybe the bug lies in there. Maybe on a VM our method of reading the number of available cores is incorrect? |
I think the default is to always leave 2 CPU cores free though, as we've often observed issues. |
It seems to lie at the intersection of low core count machine with JIT compilation disabled in I'm running the tests with |
This happens on HPC systems too (other than SLURM which we explicitly support), when the number of cores that's available doesn't match the number of cores Python can "see". |
Could be the same issue here then. Maybe we shouldn't be using |
At least on SLURM, it didn't seem to matter what function was used, it always returned the number of physical CPU cores on the machine. The only way to find the number allocated by SLURM was to directly interface with the scheduler. |
I can no longer reproduce when running on |
👍 |
Moving the discussion from #403 here.
First reported: @K-Meech
@IgorTatarnikov
@willGraham01
@IgorTatarnikov
@willGraham01
The text was updated successfully, but these errors were encountered: