New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to start with "Bad file descriptor" for AMD EPYC 7452 and 7502 CPUs running EL9 #2166
Comments
I've also now reproduced this with |
This might be the same issue described in sylabs/singularity#7 |
I built 1.3.0 both with and without sylabs/singularity#8 and it seems to fix the issue. |
I suspect this might be the same thing as what was seen in #1947. |
We found another fix for the same problem as sylabs/singularity#8 that was less disruptive in apptainer/singularity#5946. I don't know if @cclerget would be in favor of disabling the starter gc. |
Applying the fix from sylabs/singularity#8 had a dramatic effect on reducing the rate at which we see this failure with apptainer 1.3.0 so I think something more than apptainer/singularity#5946 is needed: The remaining failures are appear to be CVMFS propogation delays and errors which I misidentify when trying to isolate this in the monitoring. |
@chrisburr thanks for the detailed report, I dont think that disabling the golang garbage collector is a good fix as it could potentially lead to other issues. Will try to identify the root cause for the 1.3.2 release based on your report |
Version of Apptainer
Expected behavior
Containers start reliably.
Actual behavior
When using
AMD EPYC 7452 32-Core Processor
andAMD EPYC 7502 32-Core Processor
s I see this about half of the time:The only happens for machines running EL9-like kernels, machines running EL7-like kernels aren't affected. I'm not exactly sure exactly which distribution as my tests are running inside containers so I only have access to the host's kernel info via
uname -a
.If I use
strace
on a affecred mahine and compare the successful and failing output I see:i.e. in the bad case it closes
fd(3)
immediately before callingfstat
on that file handle.The source of the
UNIX-STREAM:[
appears to be:apptainer/cmd/starter/c/starter.c
Lines 1380 to 1391 in bcce314
Looking back earlier in the log I see in the working case:
and in the failing case:
i.e. in the failing case there is a missing call to
fcntl(3</cvmfs>, F_SETFD, FD_CLOEXEC) = 0
which I presume corrosponds to:apptainer/cmd/starter/c/starter.c
Lines 991 to 1001 in bcce314
It's still not clear to me why this is is happening and I'm now struggling to get access to a node to continue debugging on. I'll update if I manage to make any more progress.
Steps to reproduce this behavior
I've only been able to reproduce with quite a complex command:
How did you install Apptainer
From conda-forge.
The text was updated successfully, but these errors were encountered: