Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linux test-dns resolveTxt failure #1148

Open
squeek502 opened this issue Jun 26, 2021 · 11 comments
Open

Linux test-dns resolveTxt failure #1148

squeek502 opened this issue Jun 26, 2021 · 11 comments

Comments

@squeek502
Copy link
Member

squeek502 commented Jun 26, 2021

I was getting this locally and now it's happening on the CI too:

# Starting Test: dns - resolveTxt
  /home/runner/work/luvit/luvit/tests/libs/tap.lua:81: /home/runner/work/luvit/luvit/tests/test-dns.lua:86: assertion failed!
  stack traceback:
  	[C]: in function 'error'
  	/home/runner/work/luvit/luvit/tests/libs/tap.lua:81: in function </home/runner/work/luvit/luvit/tests/libs/tap.lua:64>
  	[C]: in function 'xpcall'
  	/home/runner/work/luvit/luvit/tests/libs/tap.lua:64: in function 'run'
  	/home/runner/work/luvit/luvit/tests/libs/tap.lua:165: in function 'tap'
  	/home/runner/work/luvit/luvit/tests/run.lua:42: in function 'fn'
  	[string "bundle:deps/require.lua"]:310: in function 'require'
  	/home/runner/work/luvit/luvit/main.lua:128: in function </home/runner/work/luvit/luvit/main.lua:20>
not ok 21 dns - resolveTxt

EDIT: Locally the error I'm getting is Maximum attempts reached

@squeek502
Copy link
Member Author

squeek502 commented Jun 26, 2021

Something weird is going on here:

  • The test right after also does a dns.resolveTxt('google.com') and that works fine
  • Changing the test from using google.com to using nodejs.org fixes it (this is the domain the node test-dns uses)
  • It only happens on Linux, not Mac (it's skipped on the Windows CI)
  • It also happens when using older luvi versions (tested with 2.7.6 and it fails there too)
  • EDIT: It gets fixed if I change the order of the resolveTxt test (i.e. move it to the bottom of the file)

I'm not sure I have the knowledge necessary for debugging this one properly. A quick fix would be to change the domain that it looks up the TXT records for.

@Bilal2453
Copy link
Contributor

For me, I am getting the following when building Luvit on Linux:

Uncaught Error: /mnt/bilal/home/Desktop/luvit/deps/dns.lua:690: attempt to perform arithmetic on local 'len_lo' (a nil value)
stack traceback:
        /mnt/bilal/home/Desktop/luvit/deps/dns.lua:690: in function 'handler'
        /mnt/bilal/home/Desktop/luvit/deps/core.lua:248: in function 'emit'
        ...bilal/home/Desktop/luvit/deps/stream/stream_readable.lua:172: in function 'push'
        /mnt/bilal/home/Desktop/luvit/deps/net.lua:123: in function </mnt/bilal/home/Desktop/luvit/deps/net.lua:117>
        [builtin#37]: at 0x004e1840
        /mnt/bilal/home/Desktop/luvit/init.lua:49: in function </mnt/bilal/home/Desktop/luvit/init.lua:47>
        [C]: in function 'xpcall'
        /mnt/bilal/home/Desktop/luvit/init.lua:47: in function 'fn'
        [string "bundle:deps/require.lua"]:310: in function <[string "bundle:deps/require.lua"]:266>
make: *** [Makefile:12: test] Error 255

I have traced a tiny bit of this, and found that (line 114 from net):

function Socket:_read(n)
  local onRead

  function onRead(err, data)
    timer.active(self)
    if err then
      return self:destroy(err)
    elseif data then
      p(3, n, data) -- data = '\000'
      self:push(data)
    else
      self:push(nil)
      self:emit('_socketEnd')
    end
  end

We notice here that the data getting passed is \000, now back to dns (line 685):

    function onData(msg)
      local len_hi, len_lo, len, answers

      len_hi = byte(msg, 1)
      len_lo = byte(msg, 2)
      len = lshift(len_hi, 8) + len_lo -- len_lo == nil

Since msg is \000, string.byte('\000', 2) == nil this will fail. I have tested the Readable stream class a bit, and it just works fine.
Now the data seems to be coming directly from luv: (net line 133):

uv.read_start(self._handle, onRead)

so I am not entirely sure why this single test is the one getting this kind of data.

@Bilal2453
Copy link
Contributor

I've also confirmed:

Changing the test from using google.com to using nodejs.org fixes it (this is the domain the node test-dns uses)

makes it somehow work just fine without getting this kind of weird chunk.

@Bilal2453
Copy link
Contributor

Bilal2453 commented Jun 26, 2021

I have just noticed that it doesn't have to be a different domain, just changing it to www.google.com seems to work. That's qutie weird but I guess it is not totally broken.

Good to mention, requesting google.com only without www will return a 301 - Moved. I think all tests that uses google.com should be changed to www.google.com until we find the exact reason behind this weirdness.

Update: I've changed all tests to use www.google.com and that made Test dns - resolveTxtTimeout Order fail with

  /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:81: /mnt/bilal/home/Desktop/luvit/tests/test-dns.lua:98: assertion failed!
  stack traceback:
        [C]: in function 'error'
        /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:81: in function </mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:64>
        [C]: in function 'xpcall'
        /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:64: in function 'run'
        /mnt/bilal/home/Desktop/luvit/tests/libs/tap.lua:165: in function 'tap'
        /mnt/bilal/home/Desktop/luvit/tests/run.lua:42: in function 'fn'
        [string "bundle:deps/require.lua"]:310: in function 'require'
        /mnt/bilal/home/Desktop/luvit/main.lua:128: in function </mnt/bilal/home/Desktop/luvit/main.lua:20>

... oh lol. Rest of the tests seems fine with it now, including resolveTxt.

Update 2: Changing the tests that uses google.com except resolveTxtTimeout seems like a working workaround.

Update 3: With some help from Nameless, we found that the failing request (google.com) look like:

XL\131\128\000\001\000\a\000\000\000\000\006google\003com\000\000\016\000\001\192\f\000\016\000\001\000\000\f\148\000<;facebook-domain-verification=22rm551cu4k0ab0bxsw536tlds4h95\192\f\000\016\000\001\000\000\f\148\000$#v=spf1 include:_spf.google.com ~all\192\f\000\016\000\001\000\000\f\148\000+*apple-domain-verification=30afIBcvSuDV2PLX\192\f\000\016\000\001\000\000\f\148\000EDgoogle-site-verification=TV9-DBe4R80X4v0M4U_bd_J9cpOJM0nikft0jAgjmsQ\192\f\000\016\000\001\000\000\f\148\000A@globalsign-smime-dv=CDYX+XFHUw2wml6/Gb8+59BsH31KzUr6c1l2BPvqKX8=\192\f\000\016\000\001\000\000\f\148\000EDgoogle-site-verification=wD8N7i1JTNTkezJ49swvWW48f8_9xveREV4oB-0Hf5o\192\f\000\016\000\001\000\000\f\148\000.-docusign=1b0a6754-49b1-4db5-8540-d2c12664b289

when a successful one (www.google.com) is similar to:

\175{\129\128\000\001\000\001\000\001\000\000\003www\006google\003com\000\000\016\000\001\192\f\000\005\000\001\000\000\000\000\000\018\015forcesafesearch\192\016\192\016\000\006\000\001\000\000\000<\000&\003ns1\192\016\tdns-admin\192\016\022\1885a\000\000\003\132\000\000\003\132\000\000\a\b\000\000\000<

@Bilal2453
Copy link
Contributor

@squeek502 you think we can use that as a workaround for now? or we should totally figure out the weirdness happening here first? or maybe just changing that resolveTxt domain to something else only? perhaps to www.google.com and the rest google.com

@squeek502
Copy link
Member Author

squeek502 commented Jun 26, 2021

Ok, so the servers being used are essentially global, so each test affects the next.

In its current place:

# Starting Test: resolveTxt
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
'udp_iter'	{ port = 53, host = '127.0.0.53' }
  Maximum attempts reached
not ok 9 resolveTxt

When moved to the bottom:

# Starting Test: resolveTxt
'udp_iter'	{ port = 53, tcp = false, host = '8.8.8.8' }
ok 15 resolveTxt

The servers get set to DEFAULT_SERVERS, but on Luvit init dns.loadResolvers() gets called which sets servers to the system's dns resolver (hence the 127.0.0.53:53 server).

So, one quick fix would be to just call dns.setDefaultServers() in that test. It's still strange that this is failing, but maybe that's a Linux bug? Or a Libuv bug? I'll look more into it.

EDIT: In the meantime, #1149

squeek502 added a commit to squeek502/luvit that referenced this issue Jun 26, 2021
Still not totally sure why this test was failing, but this should fix it until we understand more about it.

See luvit#1148
@Bilal2453
Copy link
Contributor

Sounds good, will test if this works on my machine now

@Bilal2453
Copy link
Contributor

Bilal2453 commented Jun 26, 2021

This is getting weirder, resolveTxt is indeed now working on my machine, though it is failing at dns - resolveMx with:

oh sorry, that looks like my bad. I just refetched the test file and it successfully passed all tests.
This should do for now

@squeek502
Copy link
Member Author

squeek502 commented Jun 26, 2021

Some more weirdness:

  • When i first boot I get Server fault (IIRC) as the error, but then can never get that again, and instead get Maximum attempts reached. Clearing dns cache doesn't change anything either
  • systemd-resolve --type=TXT google.com gives google.com: resolve call failed: Query timed out after a long while, so this might be a more general systemd dns resolver issue

@Bilal2453
Copy link
Contributor

This might be due to Google not handling DNS of type TXT well enough:

  • google.com seems to be always invalid, and always returns a 301:
systemd-resolve --type=TXT google.com
google.com: resolve call failed: Received invalid reply
  • www.google.com seems to not support at the very least, TXT:
systemd-resolve --type=TXT www.google.com
www.google.com: resolve call failed: Name 'forcesafesearch.google.com' does not have any RR of the requested type

I suggest we could change Google to something else everywhere in the tests maybe?

@creationix
Copy link
Member

I think it's fine to work around whatever weirdness they are doing on the main domain. It could be easily a custom dns server that responds differently depending on any number of factors (time of day, geo location, version of client, etc) at their scale.

zhaozg pushed a commit to zhaozg/luvit that referenced this issue Oct 11, 2022
Still not totally sure why this test was failing, but this should fix it until we understand more about it.

See luvit#1148
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants