-
Notifications
You must be signed in to change notification settings - Fork 101
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pihole doesn't respond to DNS queries without container restart #66
Comments
Thanks for the logs, that's super helpful! I see that on first boot it is unable to update any of the blocklists ( On it's own this wouldn't be a problem but it indicates that on first start the pihole container is having connection issues. Have you modified the docker-compose file at all? Can you share some of your env vars?
Also could you try setting the |
Wow, good catch - I'm not sure how I missed that difference. I saw some "Connection Refused" messages but didn't notice that it was for every blocklist and kind of got used to seeing them. No modification to docker-compose, especially while testing these branches for you. Once I get a stable version, I do change some stuff in pihole/Dockerfile to pull in a hosts file and change the font size, but again I've been careful not to do that while testing. Environment variables:
Just added |
No luck after adding |
Just to add - there's nothing special about my network: The RPi3b is wired directly into a Cisco 9200 switch with no auth configured, so the port comes up instantly. |
https://github.com/pi-hole/docker-pi-hole#deprecated-environment-variables
https://github.com/pi-hole/docker-pi-hole#docker-pi-hole-v411 You're not blocking |
Oh, man - Their Docker hub docs are a bit out of sync with the GutHub docs - https://hub.docker.com/r/pihole/pihole. Also, I'm reading this as
As for the DNS requirement - This is where my knowledge (or lack of) of Balena breaks down. Looking at
Now, I'm guessing that the environment variables in the GUI override these settings (where does that magic happen?). The thing is, I want to run OpenDNS as my upstream DNS (I work for Cisco). How do we reconcile that with the 127.0.0.1 requirement? With the new method, I guess it would be like this:
Now, do I have to wait until Balena Pihole gets an update before I can use that? (1.1.1.1 is fully accessible in my network, I use it all the time to test web redirect portals) |
Balena does not run in host mode by default, but this Pi-Hole stack does in order to avoid port conflicts. balena-pihole/docker-compose.yml Line 23 in 2c028ec
^^ So as you can see from the line above, balena-pihole/docker-compose.yml Lines 15 to 17 in 2c028ec
^^ These lines specify the container DNS, and aren't directly related to Pi-Hole, but according to their docs we want values similar to this to make sure Pi-Hole can start correctly (needs to resolve before it can resolve!). In fact, since these are only used at container startup and not to perform actual runtime queries, you could replace the balena-pihole/docker-compose.yml Lines 21 to 22 in 2c028ec
^^ These
The balena supervisor will apply dashboard environment variables to the services and override existing ones. You can go ahead and start using |
Thanks for the thorough explanation, @klutchell !
Done. So it looks like after all of this we've: a) taught Roddie a lot about how Balena works b) figured out that we need to update some environment variables and docs in the project Not a waste for me at all, but it seems like I'm still at square one and you ended up working this weekend anyway. |
Nah dude, I still got lots of gaming and dog walks in this weekend. We're all good. balena-pihole/docker-compose.yml Lines 15 to 17 in 2c028ec
What if you changed the second entry here to be something else, like google or opendns? Any difference?
Can you clarify these steps a bit? Anything that would impact the host OS that you would know of? I wonder if this would happen with a fresh image, or another device, to isolate the issue to the environment/network or the device. Pro Tip: if you want to learn a ton more about how balena works, these masterclasses are killer and actually pretty fun. |
Ha! As long as the wife isn't blaming me for gaming, we're ok.
No difference.
Harmless, I believe:
Before I even opened the initial display issue I started a new application with a new device to make sure it wasn't something left-over from the last year or three. Happy to try again and only start with the dtoverlay branch. It'll take me a half hour or so.
Excellent! Can't wait to go through them! |
@klutchell Fresh application and device, using the |
Okay, thanks for testing. You're correct that your production modifications are harmless as well. Though if you wanted you could also set For your lan hosts those can be added manually via the Pi-Hole dashboard under Local DNS -> DNS Records but your way is faster since you already have the list hosted somewhere. Let me think about your container startup issue some more, nothing obvious is coming to me right now. But remember that we aren't at square one cause we fixed your PADD display (and mine) and we noticed that some Pi-Hole env vars are being deprecated so we are tracking that. Also a learning experience isn't a waste in this line of work. |
Good to know! Is that documented somewhere? I only noticed those settings in For the LAN hosts there was a whole discussion about it in the comments section of the original balena-pihole blogpost from a few years ago and a few different options were thrown around. It's my entire network, which is about 90 hosts and this is just a hosts file that I have always maintained on my "main" server so that I can quickly find something if needed. It keeps things semi-automated, at least. Related: I learned what Docker was due to that blog post and the related comment thread, and ended up writing a whole blog series about it.
I appreciate it - Let me know if you want me to test anything. Two things are nagging at me:
True true - And I got my 3rd or 4th balena project PR submitted. |
If you'd like to try it without host mode you can replace the following line... balena-pihole/docker-compose.yml Line 22 in c2dd4c6
... with the below lines using your device's LAN IP. ports:
- "LAN_IP:53:53/tcp"
- "LAN_IP:53:53/upd" This is to prevent the balena engine from binding to 53 on ALL interfaces, because 53 is already in use on the balena VPN interface. Then you should also remove the following lines from the pihole Dockerfile: balena-pihole/pihole/Dockerfile Lines 17 to 20 in c2dd4c6
After that you'll no longer be running in host mode and we can see if the behaviour changes. You can also try resolving this error ...
... by adding |
So, I actually ended up trying this last night (I added 80/443 as well for the GUI), and I couldn't get it past this state: Just redid it using only the 53s and got the same results. I actually had to do a "Purge data" to get it to take another push and reload again. I don't want to spend too much time on the host mode thing. If anything it should make the network connectivity from the container better, not worse. Just a sticking-point with me since networking is actually my happy place.
Got rid of that last night, too - No change. That one has always irritated me. I just noticed that I didn't have dtoverlay defined (it's usually set to |
When you tried it did you specify the LAN IP in your ports? This part is very important to not conflict with balena services running on port 53 and may be why you ended up in a bad state? ports:
- "80:80/tcp"
- "LAN_IP:53:53/tcp"
- "LAN_IP:53:53/upd" |
I tried both ways with every combination - Maybe there's another port that balena is expecting? I may come back to the host mode thing later once we figure out the problem, but in my mind host mode should make things a bit more open here and I don't want to get too distracted with it. I'm currently in a non-working state and here's what I'm seeing:
Docker doesn't seem to think 53 is bound anywhere else:
Simply restarting the pihole container makes everything work. |
Very odd, sorry if it feels like we are spinning our wheels here but I really want to find out what changes in the container between restarts. Can you check the content of Also, I'm not sure what state you are in but if we are still using host networking what happens if you just remove the following line from your Dockerfile which will allow binding to all interfaces? balena-pihole/pihole/Dockerfile Lines 17 to 20 in c2dd4c6
|
Oh no, I absolutely appreciate this and since it does seem to be network-adjacent, I am enjoying it. Plus I only use two balena apps, so the more I get to tinker with them the better. I think we're getting closer. Check my logic here: ServerIP:53 isn't open to the container. Docker doesn't seem to see a problem, but because we're using host networking, we don't see the explicit port as open. ServerIP:80 works fine, so it's not an IP-binding issue. Reproducing this is just a matter of doing a shutdown and power-on, so it's easy for me to test whatever can think of.
Commenting out that line puts the pihole container into a very unhappy loop:
|
Okay, that's the behaviour I was expecting and why that line exists. Can you confirm that the interface provided for It seems almost like dnsmasq is binding to the wrong interface on first startup (like localhost only), but finding the correct one on a container restart. That would be very weird though, and somehow related to a race condition on startup or a delayed network event? I don't suppose you have a secondary network where you could connect one of these devices? Or even try wifi to see if the behaviour changes? I know I'm grasping at straws now. |
Confirmed they're both You read my mind on dnsmasq (I think we got there at the same time) - I'm collecting some |
To save you some squinting, there are two differences here that I'm seeing (you may already be ahead of me with your thinking). One of these IPv4 TCP:53 listeners is missing when things are broken (I don't know why there are two):
More importantly, this UDP:53 listener is missing when things are broken:
In both cases, dnsmasq is listening on the If dnsmasq is getting in the way at some point, it's gone by the time I have access to the console. I also don't like that whatever is failing is doing so silently. Are there any other logs that I can collect that might help? |
Good findings, I think we are narrowing it down. I've never tried this before now, but there seems to be a lot of info in the debug logs tool in the Pi-hole dashboard. https://discourse.pi-hole.net/t/how-do-i-debug-my-pi-hole-installation/3104 Might be worth grabbing these before and after a restart and comparing them. The farther we get into FTL and dnsmasq territory the less help I'll be, but I'm happy to learn! |
Yeah, I think we're close. I've been trying various things while on calls, but I don't think I've learned anything. That page looks to be more about secure transport of the logs FWIW I have looked at Also nothing in Unsurprisingly, host networking is still nagging at me - I wonder if we can eliminate that, if we'd be able to see the binding in Docker (or an error)? The fact that we can see a difference between |
No need to submit with a token, just click the button to generate them and a surprising amount of information is collected. Not just logs but simple diagnostic tests. Then you can copy them somewhere and compare. |
The balena engine logs can be viewed with |
I bet that's covered in the Masterclass which is tab number 28 in my browser :-) Nothing interesting there - Everyone thinks everyone else is happy.
Duh - That makes a whole lot more sense than what I was thinking. Ok, I'll grab them and go through them tonight/tomorrow and let you know if anything jumps out. Thanks again for all of your help here! BTW - This is from within the container: Not working:
Working:
|
So, a lot of good info in these logs - Nothing too in the weeds and everything is pretty self-explanatory. The only difference is kind of disappointing, though. An extra port "in use" which we already know about. Not working:
Working:
The rest of the debug output is virtually identical. I'll try again tomorrow to make sure I'm not just really tired and missing something, but it seems like whatever is happening is being ignored by everything. A few ideas --
|
Adding the following line to the bottom of your pihole service definition in the docker-compose file will not delay the container start, but it will delay any services running in the container. entrypoint: /bin/bash
command:
- -c
- "sleep 60 && /s6-init" Delaying the container start is a bit more tricky but we will see if that is even required.
The following changes worked for my device:
# network_mode: host
ports:
- 192.168.1.160:53:53/tcp
- 192.168.1.160:53:53/udp
- 192.168.1.160:80:80/tcp
Just remove the |
Amazing data collection above, dude. Thanks for doing all that!
Yup, that's my bad. I don't think they ever changed that behaviour, I've just been using that field incorrectly from the start.
Yup.
That was my initial thought, but the eth0 interface exists on container start as we tested with
Do you mean host networking (non-working) for this last one? Feel free to use |
Well, I felt bad for yesterday's silliness, plus it is my problem, and obviously I'm not the type to let things go or take a break from it. It's in my brain until we get it fixed.
Haha - I shouldn't be laughing, but that's kind of funny. I'll get it cleaned up.
Ok, but note the difference in bindings above - pihole-FTL is the only process binding to 10.10.10.10:53. Everything else is binding to We run pihole-FTL with
At best, it would have an APIPA address, but I don't know for sure if Linux does that - I could modify
Yep!
I did - will fix in a sec.
Yeah... I mean, it's not like I wrote an entire blog post about this topic or anything hahahaha - Getting old sucks, man. |
Yeah, no v4 address to bind to, which might explain why there's no error - nothing is actually failing.
pihole-FTL binds to all active IPs on the container, my IP isn't active by the time it runs, and that's that. What would happen if you booted your pi up with the ethernet cable disconnected, connecting it after a couple of minutes? pihole-FTL may fail because the interface is down and then just restart when it comes up, so it might not work. But if you, say, disconnected your DHCP server for a couple of minutes and tested, I bet you'd get the same result. |
You're telling me. I have a wicked hangover today from 3 drinks last night. 3 drinks?! I'm only 35.
I was able to reproduce it by disconnecting my unmanaged switch from the LAN, so eth0 was still present but no LAN access or DHCP server. I even used the active wifi connection to verify that eth0 was present but had no IP, and FTL was not binding to anything on that interface. Upon reconnecting the switch to the LAN, eth0 got an address but FTL did not restart or retry bind or anything, cause how would it know to do that? So, you solved it! We now know what is happening behind the scenes to explain this behaviour and we have several workarounds for cases where it may take more than a few seconds for an IP to be assigned to the selected interface. I'm happy to put a permanent fix in to 10-custom.sh going forward, in case anyone else has network behavior similar to this. As a network guy yourself, what kind of command would be the most reliable, and fast, way to determine if a valid address has been assigned to while ! [ condition ]
do
echo "waiting for IPv4 address on ${INTERFACE}..."
sleep 2
done Once we have this script in place and I've tested it, I can open a PR and you can retest it from there. I'll also roll in the I expected a bit more fanfare at this point, but maybe I'm still in disbelief. Oh well 🍻 |
I'm 50 this year, I wish I could tell you that it gets better. Haha - You're hungover and I've been thinking about this issue all night.
I just took this a bit further and tried to reproduce it on my non-Balena Rpi running the Not worth worrying about right now, but I would expect it to behave the same way. I'll throw a monitor on it later and figure it out.
So, there are probably a few ways to workaround this one, but first I want to say that this seems (to me) to be a pihole-FTL limitation. Why is it binding to active IPs instead of an interface? Is it because of dnsmasq? lighttpd is in the same container and isn't having a problem. Per the docs, in host mode we're feeding it This is partly why I want to reproduce this on my non-Balena Rpi. I looked at the pihole-FTL CLI options and there's really nothing helpful there. Ok... now to workarounds - We have a lot of options (I'm also going to cover the ones that aren't realistic for you or me to fix, just because it's been on my mind all night):
Off my soapbox now, sorry. More realistic workarounds:
There are probably a few others, but those were the first ones I could think of.
Hahaha - Same! I think we're both tired of this, and honestly the problem was in front of our faces the whole time. Nothing really obscure or complex or tricky. But we did it, and I'm kind of relieved that it wasn't really because of anything in the project or that either of us did. |
Possibly because your docker version is different than the one the current balena engine is based on. Perhaps the hand-off of interfaces in host mode has changed in ways I can't begin to understand.
I don't know why lighttpd works. It's binding to all interfaces so maybe
As far as I understand, pihole-FTL is a fork of dnsmasq so that's why we see config files for dnsmasq but pihole-FTL is the actual process. Also I think the binding behaviour is as follows:
Not a terrible idea, you could run in past the pi-hole/pi-hole-docker project and see what they think.
Up until now, I've been treating Interestingly, if we were running in host mode with proper docker and provided This has never impacted us because our bridge interface name is |
That may be a possible reason, but I'll try to confirm in a little bit. I've never run in host mode so I don't know what "normal" looks like. Now it's just a curiosity thing.
This is more of an OS thing than a network thing, so I'm only going by what I've seen over the years, but there seems to be four "levels" of bindings:
Just speculation above.
This is where I think they can optimize and bind to a specific interface. I'm the furthest thing from a developer, so I don't know what the exact possibilities are here when it comes to the code/OS. I guess from a Balena perspective there's always the possibility of conflicting with this:
So... I don't know if there is a "right" answer here that would work universally.
Yeah, I want to do that once I can get my other RPi to fail. I'd rather open an issue/feature request based on an actual use-case vs. a theoretical issue, and I'm trying to avoid any finger-pointing.
Ok, that makes perfect sense - I think what you've got in the PR so far looks great and should scale well since we're forcing Are we worried about
Oh that is interesting - Just tested it here. That's clever. It's really not that different than our new test :-) Thanks again for all of your help this (and last) week! Let me know when you want me to test the changes when you're happy with them. Feel free to roll them into |
Yeah, it fails in the main branch, but it's obvious in the logs why it fails. I do it all the time when I test on wifi and forget to change the interface name. Something along the lines of
I think that PR is ready to go, just leave a comment on the PR once you've tested it and I'll merge. Your screen won't work while testing this PR of course.
Once we merge
No thank YOU good sir. It's been a pleasure and don't be a stranger in my repos. |
The |
Postscript for you, @klutchell. While discussing another project's issue here: wouterdebie/locast2tuner#32 I started having flashbacks to this issue and decided to check how balena does things. Sure enough, balena will start without the network being online where Docker will wait. balena's systemd file looks like this:
I'm wondering if we had these two options along with everything else if we would have ever run into this problem:
This is what Docker's systemd file looks like for a comparison:
|
You are likely correct, but we don't want to use
|
Hmmm interesting - So, this may reflect my lack of understanding of some of the other stuff that balena does. There are user containers that would set/modify the host network settings? How typical is this? (All of these questions are just my curiosity.) Not that my scenario appears to be overly common, either, but would it make sense to have base images for "Internet" apps and base images for IoT/Other? |
We can optionally expose the host dbus, plus a number of other things like the host engine socket, to perform host OS tasks. I see it in support all the time, so it's not that rare.
I don't see the need for multiple images, if a user application requires certain features to be available at startup, that's usually trivial to add a condition to the container, vs maintaining and entire separate image for a minor change in a service file. |
You should do those masterclasses and ask questions like this on the forums so others can see your answers and we can improve our docs! |
Yeah, that makes perfect sense - Thanks for the explanation! I love this: "Warning: Making changes to the networking of a device is extremely dangerous and can lead to a device being unrecoverable, so exercise caution with any of the following."
I really should. Look man, I've had the tab open in my browser since you told me about it :-) |
Re-opening this for discussion and to possibly re-introduce the fix. This issue occurs when the pi-hole device is connected to a switch that is running a proper implementation of the spanning-tree protocol (STP.) With STP there is a delay (Forwarding Delay) when the interface comes up of ~15 seconds while STP factors in the new "path" and makes sure that it will not cause any bridging loops before putting the interface into forwarding mode. This delay impacts how quickly DHCP gets an IP address for the interface. Because we run in Docker "host" mode for scalability reasons, this results in pihole-FTL not being able to bind to the interface when it is launched because there is not yet an IP address on it. A workaround on these switches is to enable "STP Portfast" (if supported), which bypasses the STP forwarding delay on the interface and puts the port into a forwarding state immediately. While enabling portfast is a best practice when you know an interface is connected to a host (vs. another switch/potential loop), it is not enabled by default on most commercial switches, and a lot of people leave it alone unless the forwarding delay is causing problems. Consumer switches generally do run STP, but automatically set their ports to "portfast" mode. The assumption with these switches is that they will not be connected to other switches and therefore there should not be any bridging loops in the topology. In my case, I moved my pi-hole to a consumer switch a couple of years ago, so when this fix was removed, I wasn't impacted. I brought my pi-hole home yesterday and connected it to my Cisco switch and the issue has resurfaced. |
Any improvements over this shell to wait for IP assignments? while [ -z "$(ip -o -4 addr show dev "${INTERFACE}")" ]
do
echo "Waiting for IPv4 address on ${INTERFACE}..."
sleep 5
done Maybe something that also supports IPv6, or does some other network magic to determine when it's okay to bind? |
I don't run v6 at home myself, but it would make sense to have this work for both v4 and v6 if we can do it without too much complexity. I'm not sure how many people out there are running only v6, but I'm sure they exist. The only thing we can really test for is a proper IP address (v4/v6) being assigned to the interface, but maybe we can ping a hostname ( One of the other alternatives we tried above was to just force a 30 second wait before starting FTL which seemed to work and also wouldn't rely on anything or have any kinds of unpredictable failures. I just checked Adguard to see if I have the same problem there, but I think the fact that we bind with I can't believe it's been 2 1/2 years since this discussion. A lot of this is coming back to me now. |
I thought about the ping of a public URL but I think Pi-hole itself is designed to start even with no internet access. Maybe we can ping the default gateway? That should always exist and maybe fails when STP is pending? Do you want to try it? |
We don't have a default gateway if we don't have an IP address, so we don't know what to ping. |
Maybe that's the test then? Check routes for a default gateway until one exists? |
That should work. Is it much different processing wise than what we're doing with Stepping-back: I'm the only one to ever report this as an issue, since I'm sure most people don't use $30k switches in their homes. Do you think this is a problem we need to solve? |
Honestly no, but I wasn't going to stand in your way if you wanted to persue it. We can leave this issue open indefinitely, or close it with the STP workaround highlighted as the solution. |
Sounds fair - I might update the docs to put a caveat in as well instructing people to enable portfast (and pointing them to this novel) if they encounter issues. I figure if people have a switch capable of portfast then they'll know what it means. |
After a full restart of my RPi3 (shutdown and power off/on), pihole doesn't respond to DNS queries until I restart the pihole container again. Logs/testing output below:
Logs from power-on:
DNS query:
Logs following pihole container restart:
DNS query following pihole container restart:
The text was updated successfully, but these errors were encountered: