Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iotedge fails to make docker-proxy.sock after reboot (file exists) in Ubuntu Core #7255

Open
CharleeSF opened this issue Apr 1, 2024 · 12 comments

Comments

@CharleeSF
Copy link

CharleeSF commented Apr 1, 2024

Expected Behavior

If I have setup and installed azure-iot-edge snap I should be able to reboot the device and azure-iot-edge should start without any issues.

Current Behavior

azure-iot-edge fails to start with the following logs:

2024-03-27T17:29:35Z systemd[1]: Started Service for snap application azure-iot-edge.docker-proxy.
2024-03-27T17:29:35Z azure-iot-edge.docker-proxy[1201]: 2024/03/27 17:29:35 socat[1201] E "/var/snap/azure-iot-edge/common/docker-proxy.sock" exists
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Main process exited, code=exited, status=1/FAILURE
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Failed with result 'exit-code'.
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Scheduled restart job, restart counter is at 5.
2024-03-27T17:29:35Z systemd[1]: Stopped Service for snap application azure-iot-edge.docker-proxy.
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Start request repeated too quickly.
2024-03-27T17:29:35Z systemd[1]: snap.azure-iot-edge.docker-proxy.service: Failed with result 'exit-code'.
2024-03-27T17:29:35Z systemd[1]: Failed to start Service for snap application azure-iot-edge.docker-proxy.
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [INFO] - Device is QemuDevice1 on clp-dev-iq-shared-8ccd7.azure-devices.net
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [INFO] - Initializing module runtime...
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [INFO] - Using runtime network id azure-iot-edge
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [WARN] - container runtime error
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: Caused by:
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]:     0: error trying to connect: Connection refused (os error 111)
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]:     1: Connection refused (os error 111)
2024-03-27T17:29:38Z azure-iot-edge.aziot-edged[1078]: 2024-03-27T17:29:38Z [ERR!] - Failed to initialize module runtime: runtime operation error: initialize module runtime
2024-03-27T17:29:38Z systemd[1]: snap.azure-iot-edge.aziot-edged.service: Main process exited, code=exited, status=1/FAILURE
2024-03-27T17:29:38Z systemd[1]: snap.azure-iot-edge.aziot-edged.service: Failed with result 'exit-code'.
2024-03-27T17:29:38Z systemd[1]: snap.azure-iot-edge.aziot-edged.service: Scheduled restart job, restart counter is at 1.
2024-03-27T17:29:38Z systemd[1]: Stopped Service for snap application azure-iot-edge.aziot-edged.

Steps to Reproduce

Provide a detailed set of steps to reproduce the bug.

  1. Install azure-iot-edge and azure-iot-identity (both on latest/stable, revision 32 and 27 respectively)
  2. Setup azure-iot-edge with primary connection string (FYI: for me it only works with setting raw-config, otherwise I get permission errors)
sudo echo "[provisioning]
source = \"manual\"
connection_string = \"$MY_CONNECTION_STRING\"" > config.toml
sudo snap set azure-iot-edge raw-config="$(cat config.toml)"

I verify if it is working by checking the logs

snap logs azure-iot-edge -f
  1. Reboot the device: now azure-iot-edge fails to start.

Context (Environment)

  • Ubuntu Core 22
  • azure-iot-edge revision 32
  • azure-iot-identity revision 27
  • docker revision 2915
  • No failures reported by snap changes

Output of iotedge check

Seems to be OK other than docker-proxy.sock not being accesible, which is the bug I'm reporting

charlee@ubuntucore:~$ sudo iotedge check

Configuration checks (aziot-identity-service)
---------------------------------------------
√ keyd configuration is well-formed - OK
√ certd configuration is well-formed - OK
√ tpmd configuration is well-formed - OK
√ identityd configuration is well-formed - OK
√ daemon configurations up-to-date with config.toml - OK
√ identityd config toml file specifies a valid hostname - OK
√ aziot-identity-service package is up-to-date - OK
√ host time is close to reference time - OK
√ preloaded certificates are valid - OK
√ keyd is running - OK
√ certd is running - OK
√ identityd is running - OK
√ read all preloaded certificates from the Certificates Service - OK
√ read all preloaded key pairs from the Keys Service - OK
√ check all EST server URLs utilize HTTPS - OK
√ ensure all preloaded certificates match preloaded private keys with the same ID - OK

Connectivity checks (aziot-identity-service)
--------------------------------------------
√ host can connect to and perform TLS handshake with iothub AMQP port - OK
√ host can connect to and perform TLS handshake with iothub HTTPS / WebSockets port - OK
√ host can connect to and perform TLS handshake with iothub MQTT port - OK

Configuration checks
--------------------
√ aziot-edged configuration is well-formed - OK
√ configuration up-to-date with config.toml - OK
× container engine is installed and functional - Error
    Could not communicate with container engine at unix:///var/snap/azure-iot-edge/common/docker-proxy.sock.
    Please check your moby-engine installation and ensure the service is running.
√ aziot-edge package is up-to-date - OK
‼ DNS server - Warning
    Container engine is not configured with DNS server setting, which may impact connectivity to IoT Hub.
    Please see https://aka.ms/iotedge-prod-checklist-dns for best practices.
    You can ignore this warning if you are setting DNS server per module in the Edge deployment.
‼ production readiness: logs policy - Warning
    Container engine is not configured to rotate module logs which may cause it run out of disk space.
    Please see https://aka.ms/iotedge-prod-checklist-logs for best practices.
    You can ignore this warning if you are setting log policy per module in the Edge deployment.
√ proxy settings are consistent in aziot-edged, aziot-identityd, moby daemon and config.toml - OK

Connectivity checks
-------------------
23 check(s) succeeded.
2 check(s) raised warnings. Re-run with --verbose for more details.
1 check(s) raised errors. Re-run with --verbose for more details.
11 check(s) were skipped due to errors from other checks. Re-run with --verbose for more details.


Device Information

  • Host OS: Ubuntu Core 22.04
  • Architecture: amd64
  • Container OS: QEMU running on linux, but also on hardware running Ubuntu Core directly

Runtime Versions

  • aziot-edged: 1.4.33
  • Edge Agent: 1.4
  • Edge Hub: 1.4
  • Docker/Moby: 24.0.5

Workaround

sudo rm /var/snap/azure-iot-edge/common/docker-proxy.sock && sudo snap restart azure-iot-edge

Notes

Not sure if it is reproducible 100% of the time, but it occurs often enough to be a problem. Especially since my device is supposed to be able to boot without human interaction.

@CharleeSF
Copy link
Author

Also see the discussion here: https://forum.snapcraft.io/t/possible-to-run-scripts-as-root-on-startup-on-ubuntu-core/39580/7

And the comment by mborzecki1:

I briefly looked at the snap. The docker-proxy service should really be made a socket activated service, then systemd starts listening on the socket and that nonsense of unlink/reuseaddr is gone. Unfortunately, you’d need to add code to hand over the file descriptor provided by systemd to socat. Other servies which may use what docker-proxy provides should use after: [docker-proxy] in their declarations. This needs to be fixed in snap packaging, so there’s not much you can do, except for opening some PRs or filing bugs if you’re simply consuming the snap.

@damonbarry
Copy link
Member

Hi @CharleeSF I tried to repro locally and couldn't, and we haven't seen this in our internal testing. But I see from the forum thread that the problem is fairly well understood. @alexclewontin can you comment on the suggested approach in the forum thread?

@CharleeSF
Copy link
Author

Hey @damonbarry, thanks for looking into this!

I am also not always able to reproduce it. The strange thing is, once I can reproduce it, it happens on every reboot.. But not every setup has it.

Since the setup takes quite a lot of time I haven't tried to get a 100% reproduction scenario.

I have however, also seen that the docker-proxy fails due to /var/run/docker.sock not being available yet.

May I ask why edged doesn't talk to /var/run/docker.sock directly?

@CharleeSF
Copy link
Author

CharleeSF commented Apr 17, 2024

Hey @damonbarry,

Is there any progress on this? I just wanted to mention that I also regularly see that docker seems slower/later in booting than azure-iot-edge, resulting in behavior like this:

2024-04-17T17:51:26Z systemd[1]: Started Service for snap application azure-iot-edge.aziot-edged.
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Making /var/run/iotedge if it does not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1825]: Successfully made /var/run/iotedge if it did not exist
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Starting Azure IoT Edge Daemon
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Version - 1.4.33
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Obtaining Edge device provisioning data...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Device is SnapeQemuAuto2 on iq-shared-0-8ccd7.azure-devices.net
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Initializing module runtime...
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [INFO] - Using runtime network id azure-iot-edge
2024-04-17T17:51:26Z azure-iot-edge.docker-proxy[1833]: 2024/04/17 17:51:26 socat[1833] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: 2024-04-17T17:51:26Z [WARN] - container runtime error
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]: Caused by:
2024-04-17T17:51:26Z azure-iot-edge.aziot-edged[1828]:     channel closed

Further supporting a revision of the docker-proxy behavior.. (in the snapcraft forum thread a solution is suggested for this, by waiting for the /var/run/docker.sock to become available).

Restarting the docker-proxy daemon fixes the issue, but as mentioned before, I don't want to have to do anything on my device for it to boot properly. I think a slower interval between restarting the daemon of docker-proxy would also help. It seems to do 8 retries but they are all before docker has made the socket available.

I am testing azure-iot-edge with quite heavy workloads, maybe that's why docker boots slower?

@alexclewontin
Copy link

alexclewontin commented Apr 17, 2024

I can weigh in more eventually, but quickly hopping in to provide some context on why the proxy exists and iotedge doesn't talk to docker directly:

The issue is that in all-snap environments (i.e. Ubuntu Core) docker is provided as a snap, and there is no docker group, so you essentially cannot talk to docker.sock if you are running as UID != 0. aziot-edged runs as user snap_aziotedge and so the docker proxy runs as root in the context of the iotedge snap, but provides the proxy socket with snap_aziotedge ownership to let aziot-edged "escalate" its privileges here, without opening a massive hole that would allow any user to talk to the docker socket.

@alexclewontin
Copy link

alexclewontin commented Apr 17, 2024

My naive suggestion would be that edglet/contrib/snap/socat.sh could try rm -f $SNAP_COMMON/docker-proxy.sock before listening, which would maybe clean up the issue where it fails to listen because the file already exists. Maybe making aziot-edged after: docker-proxy would also help with the timing issue? Because it's a simple daemon this may or may not be comprehensive; the right way to handle that would probably be to make docker-proxy a notify-type daemon and use systemd-notify to indicate readiness after successfully listening, but there are some issues with snap confinement/PID numbers when calling it from a shell script, so I'd have to play with that to see if I could make it work

@CharleeSF
Copy link
Author

@alexclewontin

Ah, thanks for the explanation about why the proxy exists! :) That makes more sense now.

Also, for the last problem I've had, would adding something like this to socat.sh work?

docker_socket="/var/run/docker.sock"

while [ ! -e "$docker_socket" ]; do
    echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
    sleep 1
done

I think that together with the after: docker-proxy might do the trick?

@alexclewontin
Copy link

My reservation there is that because docker-proxy is a simple daemon, systemd doesn't know the difference between the wait loop and actively listening on the socket, so even when it enters the wait loop systemd will consider the proxy ready and then try to start aziot-edged. I think I'd rather keep it so socat errors out, because then there's potential for systemd to catch the problem and wait on starting aziot-edged. However that's still a bit racy, depending on how quickly socat errors out vs how quickly systemd starts aziot-edged.

The systemd-notify approach would address that race by waiting for the script to actively affirm that it is indeed ready, after successfully listening on the socket.

@CharleeSF
Copy link
Author

I see I see.

Maybe we can consider the daemon retry interval a little longer? I think it is currently very fast and stops after a few times because of it and doesn't recover..

(Currently this retry is also triggered, because aziot-edged fails, but it doesn't recover because the socket becomes available after systemd has given up on restarting it)

@CharleeSF
Copy link
Author

To give you an idea of the timeframe... I made a little daemon script that helps me recover from this.

The script:

#!/bin/bash
# This script removes a problematic file preventing snap from starting after reboot.
echo "Running remove-socket.sh"
ls -l /var/snap/azure-iot-edge/common/docker-proxy.sock
echo "Removing /var/snap/azure-iot-edge/common/docker-proxy.sock"
rm -f /var/snap/azure-iot-edge/common/docker-proxy.sock

docker_socket="/var/run/docker.sock"

while [ ! -e "$docker_socket" ]; do
    echo "Docker socket ($docker_socket) does not exist yet. Sleeping"
    sleep 1
done

echo "Sleeping to see if docker-proxy revives itself"
sleep 3

echo "Checking if docker-proxy is in failed state + restart everything if yes"
sudo snap logs azure-iot-edge.docker-proxy | tail -n1 | grep '"/var/run/docker.sock", 22): No such file or directory'
if [ $? -eq 0 ]; then
    echo "Restarting azure-iot-edge because it had failed due to no docker.sock"
    sudo snap restart azure-iot-edge
fi

The output after a reboot:

Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Running remove-socket.sh
Apr 17 18:50:49 ubuntu remove-socket.sh[741]: ls: cannot access '/var/snap/azure-iot-edge/common/docker-proxy.sock': No such file or directory
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Removing /var/snap/azure-iot-edge/common/docker-proxy.sock
Apr 17 18:50:49 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:50 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:51 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:52 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:53 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:54 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:55 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:56 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:57 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:58 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:50:59 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:00 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:01 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:02 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:03 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:04 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:05 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:06 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:07 ubuntu remove-socket.sh[735]: Docker socket (/var/run/docker.sock) does not exist yet. Sleeping
Apr 17 18:51:08 ubuntu remove-socket.sh[735]: Sleeping to see if docker-proxy revives itself
Apr 17 18:51:11 snape remove-socket.sh[735]: Checking if docker-proxy is in failed state + restart everything if yes
Apr 17 18:51:11 snape remove-socket.sh[2059]: 2024-04-17T18:51:08Z azure-iot-edge.docker-proxy[1668]: 2024/04/17 18:51:08 socat[1668] E connect(5, AF=1 "/var/run/docker.sock", 22): No such file or directory
Apr 17 18:51:11 snape remove-socket.sh[735]: Restarting azure-iot-edge because it had failed due to no docker.sock
Apr 17 18:51:11 snape remove-socket.sh[2074]: Restarted.

@alexclewontin
Copy link

Yeah certainly, I think setting the retry interval on at least the proxy, if not both daemons to 1 or more seconds would be a helpful first step

@CharleeSF
Copy link
Author

Should I make a PR for that, or do you guys prefer to do it?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants