Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Slow first execution inside the container #2050

Open
JorgeRuizITCL opened this issue Feb 28, 2024 · 13 comments
Open

Slow first execution inside the container #2050

JorgeRuizITCL opened this issue Feb 28, 2024 · 13 comments

Comments

@JorgeRuizITCL
Copy link

Version of Apptainer

What version of Apptainer (or Singularity) are you using? Run

apptainer --version (or singularity --version).

We are running apptainer version 1.2.5

Expected behavior

It is expected that each command takes the same time to execute inside the container

Example Command:
for i in {1..3}; do time (ls -la / > /dev/null); done

Host output

real    0m0.005s
user    0m0.001s
sys     0m0.004s

real    0m0.004s
user    0m0.001s
sys     0m0.004s

real    0m0.004s
user    0m0.004s
sys     0m0.000s

Actual behavior

What actually happened? Why was it incorrect?

The same running inside a container takes noticeable longer the first time.

for i in {1..3}; do time (ls -la / > /dev/null); done

real    0m0.174s
user    0m0.001s
sys     0m0.012s

real    0m0.023s
user    0m0.006s
sys     0m0.005s

real    0m0.021s
user    0m0.004s
sys     0m0.006s

This issue is way more noticeable while running the Nvidia Isaac Sim container, where the first execution takes 30s whereas the next executions only take about 3.4s. The latter behaviour is the expected, even if running the container with Docker for the very first time (without caching or anything, in a fresh docker + docker nvidia toolkit install)

In this case we are not initializing the full container, only measuring the time between the start and the time it takes to the logger to show the currently installed GPUs (with an nvidia-smi) as initializing the full container requires more time and it also compiles the shaders of the simulator.

Steps to reproduce this behavior

For the ls example, run an ubuntu:20.04 container with the following command.
apptainer shell -c ubuntu-20.04.sif

For the Isaac Sim execution based on the isaac-sim.headless.native.sh script, the same results happend with the following configurations.

CFG 1

  • With --nv
  • With -c
  • With cache --binds
  • With --writable-tmpfs
  • With --fakeroot

CFG 2

  • With --nv
  • With -c
  • With cache --binds
  • With --writable-tmpfs

CFG 3

  • With --nv
  • With `-c
  • With cache --binds
  • With --overlay using a 10GB sparse disk.

CFG 4

  • With --nv
  • With -c
  • With --overlay using a 10GB sparse disk. (should store the new cache)

CFG 5

  • With --nv
  • With -c

Cache binds mount the following directories to store the Shader Cache compiled by the program.

    --bind /tmp/docker/isaac-sim/cache/kit:/isaac-sim/kit/cache:rw \
    --bind /tmp/docker/isaac-sim/cache/ov:/root/.cache/ov:rw \
    --bind /tmp/docker/isaac-sim/cache/pip:/root/.cache/pip:rw \
    --bind /tmp/docker/isaac-sim/cache/glcache:/root/.cache/nvidia/GLCache:rw \
    --bind /tmp/docker/isaac-sim/cache/computecache:/root/.nv/ComputeCache:rw \
    --bind /tmp/docker/isaac-sim/logs:/root/.nvidia-omniverse/logs:rw \
    --bind /tmp/docker/isaac-sim/data:/root/.local/share/ov/data:rw \
    --bind /tmp/docker/isaac-sim/documents:/root/Documents:rw \

What OS/distro are you running

$ cat /etc/os-release

NAME="Ubuntu"
VERSION="20.04.6 LTS (Focal Fossa)"
ID=ubuntu
ID_LIKE=debian
PRETTY_NAME="Ubuntu 20.04.6 LTS"
VERSION_ID="20.04"
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
VERSION_CODENAME=focal
UBUNTU_CODENAME=focal

How did you install Apptainer

Via the APT repository.

Important Information

The $HOME of the user is located in an NFS mount, that's why we are avoiding mounting the $HOME and other directories with the -c argument. All the files that are required for the experiment, such as the .sif file, the .img overlay and the folder mounts are stored in the local /tmp directory.

@DrDaveD
Copy link
Contributor

DrDaveD commented Feb 28, 2024

It's not very surprising to me that initializing things takes extra time, especially when using overlays. What could be helpful is if you could find a combination that was considerably faster, so the slowest subsystem could be identified. I suggest trying it with apptainer-suid installed to see if that makes a difference, and trying it with sandbox containers and/or overlays.

@JorgeRuizITCL
Copy link
Author

Thank you for your response!
There is indeed a slightly speedup with the sandbox container (30s -> 20s for Isaac Sim). I cannot test the performance with the apptainer-suid as i am running these containers in production nodes and i am not allowed to modify them in any way.

@DrDaveD
Copy link
Contributor

DrDaveD commented Feb 29, 2024

It's unfortunate that you can't try suid mode, it would be good to have that comparison. Maybe you could ask a system admin to help you out temporarily for a test?

Make sure that it is running squashfuse_ll out of /usr/libexec/apptainer/bin. It should be if it was installed via apt by the system administrator. The benchmarks I ran in #665 didn't see much startup cost compared to a sandbox, but maybe your application is that much more punishing on squashfuse_ll. Can you tell if there's any slowdown compared to a sandbox once the application is up and running?

@DrDaveD
Copy link
Contributor

DrDaveD commented Feb 29, 2024

I take it back, the final (lhcb-gen-sim-bmk) version of that benchmark also did see 15 seconds additional time for squashfuse_ll vs local disk sandbox. I just didn't break down startup time vs execute time. It saw 11 seconds additional time for squashfs vs local disk sandbox. It's hard to say how much of that was within the margin of error, however, since I didn't distinguish between startup and execute time. The primary benchmark I was using (atlas-gen-bmk) showed 8 seconds slower for squashfuse_ll vs sandbox, and it showed the kernel squashfs to be 4 seconds slower than that. That was non-intuitive and I chalked that up to margin of error.

@JorgeRuizITCL
Copy link
Author

Hi DrDaveD,

I'm unsure if its running squasfuse_II, the binary is in that path but there is a high CPU usage related to fuse2fs.

Even a pip install with an overlay takes ages, and it's the only bottleneck we have in our system.
At the moment we are working with Rootless docker, but we have develop a tool based on apptainer that we cannot migrate but is not as responsive due to the FS :(.

How can i detect the bottleneck?

@DrDaveD
Copy link
Contributor

DrDaveD commented Mar 4, 2024

fuse2fs is slow. Avoid using overlay images if you need performance. Have you tried an overlay sandbox?

@JorgeRuizITCL
Copy link
Author

Hi, if i try to run a sandbox container (and fakeroot) with an overlay (sparse and normal + fakeroot), the apptainer exec command hangs forever and the container is never instantiated.

@DrDaveD
Copy link
Contributor

DrDaveD commented Mar 6, 2024

I doubt you'll be able to give me instructions on how to reproduce that, so you're going to have to dive in deeper to see where it is hanging. Maybe the -d debug option will help, or possibly strace.

If you can arrange to run it without using overlay at all that would probably speed things up considerably.

@JorgeRuizITCL
Copy link
Author

Hi,

For this current test i'm using the sandboxed nvcr.io/nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04 OCI image with an --sparse --fakeroot --size 1024 overlay.

The -d option prints the following line forever.
VERBOSE [U=0,P=394912] mountGeneric() Overlay mount failed with invalid argument, mounting without xino option

Notes:

  • The sandboxed image was converted into a sif from a docker-archive .tar and then build from that sif with --sandbox.
  • Here is the full -d log (https://pastebin.com/Mj676fvM)
  • I ran the following command from the /tmp dir apptainer -d exec --fakeroot --overlay overlay.img cuda11dev2204s.sif ls -la

@DrDaveD
Copy link
Contributor

DrDaveD commented Mar 7, 2024

That is indeed a bad bug. I can reproduce it with a base ubuntu20.04 image. Fortunately it does not happen with the 1.3.0-rc.2 release. Please upgrade to that version for this test. I will create a separate issue just to document this.

@JorgeRuizITCL
Copy link
Author

Okey, we will update Apptainer and try to replace it with the SUID installation in the following weeks.
I will let you know if using apptainer-suid + sandbox + writable overlays improve the overall performance.

@JorgeRuizITCL
Copy link
Author

Hi,

I have had the opportunity to run apptainer-suid + sandbox images + sparse overlay in a different machine and the results are pretty bad.

Installing Tensorflow with this configuration inside the Cuda 11 Ubuntu 22.04 container takes up to 5 minutes meanwhile it only takes around 30s to execute the very same command with Docker Rootless.

I know that using overlays is far from ideal, but our workflow is designed to avoid creating new images as .sif are slow to build and modify and not all of our users are used to create container images and prefer to install the dependencies via pip and apt.

@DrDaveD
Copy link
Contributor

DrDaveD commented Mar 15, 2024

I was suggesting using a directory overlay, not a sparse overlay image. An overlay image will use fuse2fs. Docker does not use overlay images, they use plain directories. Whether the underlying image is a sandbox or SIF doesn't make a lot of difference to the performance in my experience, not when you're running only one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants