Skip to content

Latest commit

 

History

History
156 lines (114 loc) · 7.25 KB

Troubleshooting.md

File metadata and controls

156 lines (114 loc) · 7.25 KB

Troubleshooting

Copyright 2022,2023 Nvidia Corporation. All rights reserved.

...for when things don't go as expected

Installing DUE

Symptom: Docker (or Podman) isn't installed

Installing Docker without a DUE .deb

If you've downloaded DUE as source, install its dependencies by running: sudo apt update ; sudo apt install docker.io git rsync binfmt-support qemu qemu-user-static

Note 1: docker.io can be replaced with docker-ce, or podman.
Note 2: Newer host system distributions may provide the systemd-binfmt package to support the execution of non-native binaries, so that the binfmt-support package is not explicitly needed.

The last three packages there are optional, but necessary if you want to run alternate architectures.

TIP if you are on the master Git branch of DUE, you can run make install to install DUE without going through package management.

Installing Docker through the DUE .deb

The lack of Docker will be obvious on the initial install of the DUE .deb, as you'll see the error:
due depends on docker.io | docker-ce | podman; however: Package docker.io is not installed. Package docker.ce is not installed. Package podman is not installed.

To resolve this, try: sudo apt update
sudo apt install --fix-broken If that fails (and might, depending on how old the version of your operating system is), try sudo apt install docker.io
...and if that fails, try downloading and installing docker.ce from https://hub.docker.com

Running DUE

Symptom: Docker containers don't run (or only run as root).

You'll see Got permission denied while trying to connect to the Docker daemon socket You are probably not a member of the Docker group, so you'll need to:

Add yourself to the Docker group:

sudo usermod -a -G docker $(whoami)

You may have to log out and back in again for the group change to take effect. Running groups should show docker along with your other groups.

Symptom: Strange failures and permission errors in the container.

Check that the host directory the container is using is a LOCAL file system. I've seen strange permission related errors when Docker is mounting a file system that is network mounted. If your home directory is NFS mounted on your build system, consider creating a work directory on the host system and using either /etc/due/due.conf or ~/.config/due/due.conf ( generate this with ./due --manage --copy-config ) to specify this local work directory as your "home" directory. You'll probably want to copy config files, etc to the new "home" directory.

Symptom: Can't mount file systems or missing dev entries in container.

Certain operations (like loopback mounting files) are restricted within the container because they would require root level access to the host file system. While Docker containers can run with the --privileged option which would allow this access, it also provides a false sense of security that actions taken within the container won't trash the host system. Bottom line: this can be done, but it carries risks.

Symptom: DNS failures

In general, Docker will use the host system's network configuration within the container, so the contents of /etc/hosts and /etc/resolv.conf will be set at run time, which may not be what you want.

Overriding /etc/hosts

If a templates//filesystem/etc/hosts file is present for image creation, the container-create-user.sh script (being the first process run) will append its contents to the /etc/hosts file that is generated by Docker.
This is useful if you have static addresses to add.

Using a VPN with Docker

For image creation

If your image needs to access resources over the host's VPN, during image creation, and it is failing, make sure the host VPN is up, then restart Docker so that Docker becomes aware of the VPN, and then retry image creation. Example:
sudo systemctl stop docker
sudo systemctl start docker

In the container

VPN software (like Openconnect) can work in a container, regardless if the host system is connected to a VPN or not.

Check the /etc/resolv.conf file

Double check that the container's /etc/resolv.conf file has been updated properly by any VPN software (like openconnect) running in the container or on the host. It may be prioritizing the host's primary network connection rather than the VPN. If in doubt, make sure the VPN's domain is the first one listed, so that it is searched first.
Example from /etc/resolv.conf:
search myVPNDomain myISPDomain

Symptom: Running emulated containers fails.

If QEMU is properly and fully installed, DUE should be able to run containers of other architectures seamlessly. If you're reading this, then you've found a seam and should file a bug at: https://github.com/CumulusNetworks/DUE/issues
Note Be aware that newer host distributions may use systemd-binfmt to handle the running of non-native binaries, and as a result, the following binfmt-support suggestions may or may not apply in your environment.

Fails with: standard\_init\_linux.go:211: exec user process caused "exec format error"

So far this has been the only time I've seen this die, and I tracked it down to my system's binfmt-support not being configured to handle ARM binaries. Ideally, qemu should register the architectures it can run with binfmt-support, so that when non-native code is encountered, it can be passed off to qemu.

###Other emulation related failures to check: ####Are there qemu-* entries under /proc/sys/fs/binfmt_misc/ If ls -l /proc/sys/fs/binfmt_misc doesn't show them, then a few required packages may not be installed. Try:

sudo apt update ; sudo apt install qemu qemu-user-static binfmt-support

This should create the entries. If this fails, try reconfiguring qemu-user-static, with:
sudo dpkg-reconfigure qemu-user-static
which should have configured binfmt-support to have the entries. I had to do this on one system, for reasons that aren't completely clear to me.

####Is the binfmt service running? List bimfmt files systemctl list-unit-files | grep binfmt
Restart binfmt-support
sudo systemctl restart binfmt-support.service

Debugging a failed image creation

If image creation does not complete, a partial image will have been created with the name <none>. Running due --manage --list-images will list all containers on the system with the most recently created ones listed first.

To get inside the failed container and debug it, run:

due --run --debug
and select the image.
Then: cd /due_configuration

Here you'll find all the configuration scripts that were run to create the container, so you can run them in the container as needed to track down the failure.

Your home directory will be mounted under /home/root, so any file changes you make can be persisted by copying them there.

Cleaning up failed images

Run due --manage --delete-matched none
This gets the IDs of all images that have 'none' in their name and generates a script named delete_these_docker_images.sh that can be run to delete all those images.

--delete-matched filters images with with *term-supplied* so you should check that the images listed in the script are, indeed, the ones you want to get rid of.