New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iotedge fails to make docker-proxy.sock after reboot (file exists) in Ubuntu Core #7255
Comments
Also see the discussion here: https://forum.snapcraft.io/t/possible-to-run-scripts-as-root-on-startup-on-ubuntu-core/39580/7 And the comment by mborzecki1:
|
Hi @CharleeSF I tried to repro locally and couldn't, and we haven't seen this in our internal testing. But I see from the forum thread that the problem is fairly well understood. @alexclewontin can you comment on the suggested approach in the forum thread? |
Hey @damonbarry, thanks for looking into this! I am also not always able to reproduce it. The strange thing is, once I can reproduce it, it happens on every reboot.. But not every setup has it. Since the setup takes quite a lot of time I haven't tried to get a 100% reproduction scenario. I have however, also seen that the docker-proxy fails due to /var/run/docker.sock not being available yet. May I ask why edged doesn't talk to /var/run/docker.sock directly? |
Hey @damonbarry, Is there any progress on this? I just wanted to mention that I also regularly see that docker seems slower/later in booting than azure-iot-edge, resulting in behavior like this:
Further supporting a revision of the docker-proxy behavior.. (in the snapcraft forum thread a solution is suggested for this, by waiting for the /var/run/docker.sock to become available). Restarting the docker-proxy daemon fixes the issue, but as mentioned before, I don't want to have to do anything on my device for it to boot properly. I think a slower interval between restarting the daemon of docker-proxy would also help. It seems to do 8 retries but they are all before docker has made the socket available. I am testing azure-iot-edge with quite heavy workloads, maybe that's why docker boots slower? |
I can weigh in more eventually, but quickly hopping in to provide some context on why the proxy exists and iotedge doesn't talk to docker directly: The issue is that in all-snap environments (i.e. Ubuntu Core) docker is provided as a snap, and there is no docker group, so you essentially cannot talk to docker.sock if you are running as UID != 0. aziot-edged runs as user snap_aziotedge and so the docker proxy runs as root in the context of the iotedge snap, but provides the proxy socket with snap_aziotedge ownership to let aziot-edged "escalate" its privileges here, without opening a massive hole that would allow any user to talk to the docker socket. |
My naive suggestion would be that |
Ah, thanks for the explanation about why the proxy exists! :) That makes more sense now. Also, for the last problem I've had, would adding something like this to socat.sh work?
I think that together with the |
My reservation there is that because docker-proxy is a simple daemon, systemd doesn't know the difference between the wait loop and actively listening on the socket, so even when it enters the wait loop systemd will consider the proxy ready and then try to start aziot-edged. I think I'd rather keep it so socat errors out, because then there's potential for systemd to catch the problem and wait on starting aziot-edged. However that's still a bit racy, depending on how quickly socat errors out vs how quickly systemd starts aziot-edged. The systemd-notify approach would address that race by waiting for the script to actively affirm that it is indeed ready, after successfully listening on the socket. |
I see I see. Maybe we can consider the daemon retry interval a little longer? I think it is currently very fast and stops after a few times because of it and doesn't recover.. (Currently this retry is also triggered, because aziot-edged fails, but it doesn't recover because the socket becomes available after systemd has given up on restarting it) |
To give you an idea of the timeframe... I made a little daemon script that helps me recover from this. The script:
The output after a reboot:
|
Yeah certainly, I think setting the retry interval on at least the proxy, if not both daemons to 1 or more seconds would be a helpful first step |
Should I make a PR for that, or do you guys prefer to do it? |
Expected Behavior
If I have setup and installed azure-iot-edge snap I should be able to reboot the device and azure-iot-edge should start without any issues.
Current Behavior
azure-iot-edge fails to start with the following logs:
Steps to Reproduce
Provide a detailed set of steps to reproduce the bug.
I verify if it is working by checking the logs
Context (Environment)
snap changes
Output of
iotedge check
Seems to be OK other than docker-proxy.sock not being accesible, which is the bug I'm reporting
Device Information
Runtime Versions
Workaround
sudo rm /var/snap/azure-iot-edge/common/docker-proxy.sock && sudo snap restart azure-iot-edge
Notes
Not sure if it is reproducible 100% of the time, but it occurs often enough to be a problem. Especially since my device is supposed to be able to boot without human interaction.
The text was updated successfully, but these errors were encountered: