-
Steps to reproduce the problemMake any kind of external DNS request through Mastodon--either through the iOS app, the web application, or Expected behaviourThe 1491 domains my own instance has connected with should be crawled. Actual behaviourAll 1491 are listed as having failed. Other symptomatic behavior:
Detailed descriptionI've been running my own instance for the past month and it's been working fine, up until one of the 5 nodes in the kubernetes cluster died. Since then (and even since rebooting the node), my instance has been unable to load attached media from other users, and sidekiq logs show a huge crush of
or
or
Point being: it's not specific to the other instances, but rather a symptom of something wrong on mine. I even ran
The I modified the coredns ConfigMap to include logging, and what I see from running
(many, many more entries follow, all looking qualitatively the same as above) It seems like the URLs it's hitting are... odd. The first few in particular that have to do with the local postgres database are especially odd, where the domain suffix is repeated. For reference, its actual FQDN looks like this: On the mastodon-sidekiq pod, the
So Mastodon doesn't seem able to resolve anything, BUT I can run basic curl/wget commands from within the Mastodon pods and... they work just fine. I picked an instance at random from that list of 1491, ran
Here's what coredns sees when I run that command:
which, given the final entry, suggests it eventually figures out the correct FQDN. Basically, it seems like: the problem ABSOLUTELY is an inability to resolve anything outside of my instance, despite the pods themselves appearing to be able to access things just fine. SpecificationsMastodon 4.0.2 (via helm) |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 2 replies
-
Significantly updated the ticket with more details around what coredns is seeing, and information around the iterations of FQDNs that coredns is making as it attempts to resolve outgoing requests. |
Beta Was this translation helpful? Give feedback.
-
This is about as much as I could boil it down to try and illustrate what's going on:
Here's what coredns saw:
And, again, what coredns saw:
In both cases, there's a cascading resolution from I have no explanation for this behavior, since it's taking place on the same pod. Any insights would be greatly appreciated. |
Beta Was this translation helpful? Give feedback.
-
I've no idea why, but: running It's confusing to me, because everything was working fine for the past four weeks--tailscale was installed that whole period--and only stopped working when one of the five nodes of the cluster died. But removing tailscale seems to have solved things, or at least allowed them to self-heal, inasmuch as kubernetes does. There is some media that is more stubborn about re-loading--aggressive caching?--but the primary issue seems to have been resolved. |
Beta Was this translation helpful? Give feedback.
-
@magsol Recently I had a similar issue. After some investigation, I came to a preliminary conclusion: the DNS server returned the wrong RCode. This is the log from coredns when I use wget to request misskey.io
When mastodon requests to (Machine translation is used) |
Beta Was this translation helpful? Give feedback.
I've no idea why, but: running
apt-get remove tailscale
and rebooting each node in sequence seems to have fixed things.It's confusing to me, because everything was working fine for the past four weeks--tailscale was installed that whole period--and only stopped working when one of the five nodes of the cluster died.
But removing tailscale seems to have solved things, or at least allowed them to self-heal, inasmuch as kubernetes does.
There is some media that is more stubborn about re-loading--aggressive caching?--but the primary issue seems to have been resolved.