New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Couldn't always determine user account information in slurm #2213
Comments
singularity-3.8.3 is awfully old. Can you instead do |
I asked for the update. I can now use I rebuild the image with the newer version. The frequency of the errors above seems to be slightly lower but they are still there. |
With the new version ( FATAL: Couldn't determine user account information: user: lookup userid numeric_UID connection refused |
I'm afraid that looks to me like a slurm problem, and I don't really know anything about how to debug that. Maybe you can get some help from the slurm project, or otherwise we'll need you to try to collect more debugging info. Where is the "Failed to get singularity version" coming from? Is that a slurm message? Or something in your scripts? Perhaps you can show before running singularity whether or not the user account exists in /etc/passwd. Apptainer requires that. This looks related to #1066. |
Thanks. I am not sure it is a slurm problem as sometimes (very rarely) I get the same error when I start the singularity image from the login node of the HPC without SLURM. The error appears after launching There are almost no users in If you have any suggestions on how to debug this, I would be very happy to try it out |
I would like to see the exact output from when it is failing outside of SLURM because it might not be the same. If it's a rare issue, maybe you could try to reproduce it by starting up the image many times from a script in a loop until it fails. Separately, if it fails much more often under SLURM, perhaps you could run it with
That shouldn't matter because apptainer should be able to read it from the network and create a corresponding |
Thanks for the suggestions. I will explore the outcomes and update further the issue |
Version of Apptainer
What version of Apptainer (or Singularity) are you using? Run
Expected behavior
I am launching singularity from a SLURM file in an HPC setting
Expected behaviour is that the SLURM job starts and each array starts a process opening the singularity image (accessible to all nodes). This is happening 80% of the time
Actual behavior
Often I get the following error
The numeric_UID is my actual numeric UID on the HPC but it is missing the step to get the UID (not numeric).
Example: my UID is user002, my numeric_UID is 1234567.
As far as I understand, the join between the two happens via active directory as the UID is the ID used for all services (including windows)
Steps to reproduce this behavior
I have not been able to fully reproduce this behaviour. it seems a combination of limited network connection to the active directory component, number of users connected and something else not yet identified
Is there a way to catch the behaviour of singularity in the SLURM file? As I can not reproduce it everytime, it is difficult to catch it
Is there a way to set the UID manually? As it would be always the same in this case, it could help
Any other suggestions would be welcomed
What OS/distro are you running
How did you install Apptainer
It is installed as a module on the HPC, not sure how the administrators installed it.
The text was updated successfully, but these errors were encountered: