Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couldn't always determine user account information in slurm #2213

Open
laurapoggio-sptools opened this issue May 7, 2024 · 7 comments
Open

Comments

@laurapoggio-sptools
Copy link

Version of Apptainer

What version of Apptainer (or Singularity) are you using? Run

$ singularity --version 
singularity version 3.8.3 

Expected behavior

I am launching singularity from a SLURM file in an HPC setting

module load singularity/3.8.3 
F=${SLURM_ARRAY_TASK_ID} 
FILE_LIST="path/to/list/of/filesToprocess.txt 

singularity exec -B /storage/:/storage/ /path/to/singularity_image \ 
    path_to_my_bash_script.sh \ 
    voi="something" \ 
    infile=`cat $FILE_LIST | sort | sed -n "${F}p"` 

Expected behaviour is that the SLURM job starts and each array starts a process opening the singularity image (accessible to all nodes). This is happening 80% of the time

Actual behavior

Often I get the following error

Failed to get singularity version:  
WARNING: Could not lookup the current user's information: user: lookup userid numeric_UID: no such file or directory  
FATAL: Couldn't determine user account information: user: lookup userid numeric_UID: no such file or directory 

The numeric_UID is my actual numeric UID on the HPC but it is missing the step to get the UID (not numeric).
Example: my UID is user002, my numeric_UID is 1234567.
As far as I understand, the join between the two happens via active directory as the UID is the ID used for all services (including windows)

Steps to reproduce this behavior

I have not been able to fully reproduce this behaviour. it seems a combination of limited network connection to the active directory component, number of users connected and something else not yet identified

Is there a way to catch the behaviour of singularity in the SLURM file? As I can not reproduce it everytime, it is difficult to catch it
Is there a way to set the UID manually? As it would be always the same in this case, it could help
Any other suggestions would be welcomed

What OS/distro are you running

$ cat /etc/os-release 
NAME="Ubuntu" 
VERSION="20.04 LTS (Focal Fossa)" 
ID=ubuntu 
ID_LIKE=debian 
PRETTY_NAME="Ubuntu 20.04 LTS" 
VERSION_ID="20.04" 
HOME_URL="https://www.ubuntu.com/" 
SUPPORT_URL="https://help.ubuntu.com/" 
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/" 
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy" 
VERSION_CODENAME=focal 
UBUNTU_CODENAME=focal

How did you install Apptainer

It is installed as a module on the HPC, not sure how the administrators installed it.

@DrDaveD
Copy link
Contributor

DrDaveD commented May 7, 2024

singularity-3.8.3 is awfully old. Can you instead do module load apptainer?

@laurapoggio-sptools
Copy link
Author

laurapoggio-sptools commented May 8, 2024

I asked for the update. I can now use apptainer version 1.3.0

I rebuild the image with the newer version. The frequency of the errors above seems to be slightly lower but they are still there.

@laurapoggio-sptools
Copy link
Author

With the new version (apptainer version 1.3.0) the error changed slightly as well. Now it is

FATAL:   Couldn't determine user account information: user: lookup userid numeric_UID connection refused

@DrDaveD
Copy link
Contributor

DrDaveD commented May 9, 2024

I'm afraid that looks to me like a slurm problem, and I don't really know anything about how to debug that. Maybe you can get some help from the slurm project, or otherwise we'll need you to try to collect more debugging info.

Where is the "Failed to get singularity version" coming from? Is that a slurm message? Or something in your scripts? Perhaps you can show before running singularity whether or not the user account exists in /etc/passwd. Apptainer requires that.

This looks related to #1066.

@laurapoggio-sptools
Copy link
Author

Thanks. I am not sure it is a slurm problem as sometimes (very rarely) I get the same error when I start the singularity image from the login node of the HPC without SLURM.

The error appears after launching singularity exec and, sometimes even singularity shell. I think it is coming from singularity.

There are almost no users in /etc/passwd because the users management is based on a network type of login.

If you have any suggestions on how to debug this, I would be very happy to try it out

@DrDaveD
Copy link
Contributor

DrDaveD commented May 14, 2024

I would like to see the exact output from when it is failing outside of SLURM because it might not be the same. If it's a rare issue, maybe you could try to reproduce it by starting up the image many times from a script in a loop until it fails.

Separately, if it fails much more often under SLURM, perhaps you could run it with apptainer -d for debugging (or set environment variable APPTAINER_DEBUG=1) and show us the debug output for a failing case. That might help.

There are almost no users in /etc/passwd because the users management is based on a network type of login.

That shouldn't matter because apptainer should be able to read it from the network and create a corresponding /etc/passwd entry inside the container.

@laurapoggio-sptools
Copy link
Author

Thanks for the suggestions. I will explore the outcomes and update further the issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants