Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dane validity failed - use validating resolver for DANE existence subtest #1358

Open
je3f0o opened this issue Mar 25, 2024 · 14 comments
Open
Assignees
Labels
bug Unexpected or unwanted behaviour of current implementations

Comments

@je3f0o
Copy link

je3f0o commented Mar 25, 2024

Hello,

My domain's DANE validity has failed for both the website and email servers. In previous years, my domain consistently scored 100 and was even featured on the Hall of Fame page. However, this year, something may have changed, and I may have overlooked it or I forget something I changed. Unfortunately, my DANE validation continues to fail, and I'm unsure of the issue. While other tools shows me positive results, I'm perplexed by what's wrong with my domain. Below is the test result:
https://internet.nl/mail/teachmecoding.club/1194082/

Can you help point me in the right direction? I'm using Postfix BTW...

$ postconf mail_version
mail_version = 3.3.0
@bwbroersma
Copy link
Collaborator

bwbroersma commented Mar 25, 2024

Interesting, I also checked with https://www.huque.com/bin/danecheck-smtp and it fails on the TLSA DNS record:

GetTLSA: bad response code to TLSA query _25._tcp.teachmecoding.club.: SERVFAIL

I also had some issues (no response on $ dig +short _25._tcp.teachmecoding.club TLSA), and I see you already are on it, https://dnsviz.net/d/_25._tcp.teachmecoding.club/dnssec/:
_tcp.teachmecoding.club is a NXDOMAIN, while _25._tcp.teachmecoding.club exists, that is ❌ in terms of DNS.

The DANE existence is done with a direct query, while the DANE validity check is done with ldns, probably this one performs a different DNS check, resulting in this correct output, but indeed tricky to understand correctly.

with subprocess.Popen(
[
settings.LDNS_DANE,
"-c",
"/dev/stdin", # Read certificate chain from stdin
"-n", # Do not validate hostname
"-T", # Exit status 2 for PKIX without (secure) TLSA records
"-r",
settings.IPV4_IP_RESOLVER_INTERNAL_VALIDATING, # Use internal unbound resolver
"-f",
settings.CA_CERTIFICATES, # CA file
"verify",
hostname,
str(port),
],

So this is the test that done:

$ openssl s_client -showcerts -crlf -starttls smtp -connect teachmecoding.club:25 -verify_quiet </dev/null 2>/dev/null \
| ldns-dane -c /dev/stdin -n -T -r 9.9.9.9 verify teachmecoding.club 25
dane_query: The queried resource records were bogus

I'm not sure why, but I do sometimes see it succeeds (also on https://www.huque.com/bin/danecheck-smtp):

$ openssl s_client -showcerts -crlf -starttls smtp -connect teachmecoding.club:25 -verify_quiet </dev/null 2>/dev/null \
| ldns-dane -c /dev/stdin -n -T -r 9.9.9.9 verify teachmecoding.club 25
CN=teachmecoding.club dane-validated successfully

So it would help a lot if the output of ldns is included in the failure of DANE here. Note to first solve the DNS issue, and then maybe wait 300 seconds (this is your negative response caching TTL) because #1340 is not yet deployed.

@je3f0o
Copy link
Author

je3f0o commented Mar 25, 2024

Ok, thank you.

I will try to fix DNS issue by contacting to NS authorities. By the way I don't trust the tool https://www.huque.com/bin/danecheck-smtp. Because this site gives me random results. First time it always fail, then i try again DANE Authentication Successful. Basically 50/50 results.

@bwbroersma
Copy link
Collaborator

I'm seeing the same mixed results on my command line when running ldns-dane with 9.9.9.9, so it might be due to the resolver (and e.g. some caching).

@baknu
Copy link
Contributor

baknu commented Mar 25, 2024

The failure probably has to do with "qname minimization"(RFC 7816) that the Internet.nl-resolver for the "DANE validity" subtest does. In addition see RFC 8020 which states:

This document states clearly that when a DNS resolver receives a
response with a response code of NXDOMAIN, it means that the domain
name which is thus denied AND ALL THE NAMES UNDER IT do not exist.

The nameservers of teachmecoding.club do not handle "qname minimization" correctly. See DNSviz and the relevant below test result quote. This seems to be the reason why the subtest fails.

A query for _25._tcp.teachmecoding.club results in a NOERROR response, while a query for its ancestor,
_tcp.teachmecoding.club, returns a name error (NXDOMAIN), which indicates that subdomains of
_tcp.teachmecoding.club, including _25._tcp.teachmecoding.club, don't exist.

Besides, this case also makes clear that something seems to have changed (unintentionally) in the Internet.nl code base. Previously, such a domain namely would already fail on the "DANE existence" subtest. So, apparently the Internet.nl Unbound (that is used for the "DANE existence" subtest) does not do "qnmame minimization" anymore, while ldns (that is used for the "DANE validity" subtest) still does. For comparison:

I suggest to:

  1. Enable qname mimimization again on Unbound
  2. Make this kind of failure more transpartant/explicit for users which was already proposed in Making failure reason due to not supporting qname minimization explicit #534

Btw not sure why Internet.nl uses ldns. Alternatively, Internet.nl could probably also query DANE TLSA records with Unbound and validate these records with OpenSSL. See: https://www.openssl.org/docs/manmaster/man3/SSL_CTX_dane_enable.html

@bwbroersma
Copy link
Collaborator

bwbroersma commented Mar 25, 2024

I agree to fix #534.
The current unbound configuration for the 2 resolvers (resolver-permissive.conf.template and resolver-validating.conf.template) don't explicitly set qname-minimisation: yes, but it should default to yes, also the ldns uses the same (validating) resolver.

@baknu
Copy link
Contributor

baknu commented Mar 25, 2024

@gthess Can you please check what is happening here? Thanks!

@gthess
Copy link
Collaborator

gthess commented Mar 26, 2024

I believe this is because of the following change:

self._ub_ctx.set_fwd(settings.IPV4_IP_RESOLVER_INTERNAL_PERMISSIVE)
The Unbound context here used to be a validating resolver. When in validation mode, Unbound would end qname-minimization when a signed NXDOMAIN is encountered during resolution. This is the case with this domain where the end result is BOGUS and previous versions error out on DANE existence.

After the above change, queries are forwarded to a non-validating Unbound which upon encountering the NXDOMAIN, skips qname-minimization as best effort in order to try and resolve the query; which it does.

ldns-dane then is configured to use the validating resolver which provides the BOGUS answer.

@baknu
Copy link
Contributor

baknu commented Mar 26, 2024

Thanks @gthess

Interesting that batch is showing the same behaviour as single test. See: https://batch.internet.nl/mail/teachmecoding.club/7964015/#control-panel-28

I expected this change to be part of the Docker release, and thus batch to not show this behaviour (as it currently does not run on Docker). But maybe this change was already present in or backported/copied to the non-Docker. @mxsasha Could you check?

@bwbroersma
Copy link
Collaborator

bwbroersma commented Mar 26, 2024

Pre-docker there was:

if settings.ENABLE_BATCH and settings.CENTRAL_UNBOUND:
self._ub_ctx.set_fwd(f"{settings.CENTRAL_UNBOUND}")

Defined by:

UNBOUND_ADDRESS = getenv("UNBOUND_ADDRESS", "127.0.0.1@4321")

# Central unbound where all the pyunbounds forward to. Used for better
# cache performance while batch testing. Format is "ip@port".
# Leave empty ("") for disabling the feature; NOT recommended.
# By default unbound runs on port 53 (...), so there is a different unbound setup recommended, but not documented
# You can verify it running by: dig internet.nl @localhost
CENTRAL_UNBOUND = UNBOUND_ADDRESS

The CENTRAL_UNBOUND was dropped / simplified to IPV4_IP_RESOLVER_INTERNAL_PERMISSIVE:

Since batch currently has the same behavior as docker, probably the unbound conf was changed prior to the dockerization, and the dockerization (this change) itself is correct.

The thing to check is the (central) unbound config on batch, and when it was last changed.

From the documentation it reads the new IPV4_IP_RESOLVER_INTERNAL_VALIDATING it only used for ldns-dane (see code in comment above), and that seems to be the case indeed:

1. A non-validating resolver, used for DNS resolving by almost all tests. As we have our own DNSSEC validation test, we want to see bogus responses as well.
2. A validating resolver, used to validate DANE records through ldns-dane.

@mxsasha
Copy link
Collaborator

mxsasha commented Apr 2, 2024

Following from discussion with @bwbroersma: the DANE checks should use the validating resolver, and not report bogus records, as they do not "count" for DANE. More optimal: report any bogus response from any check in the DNSSEC result, but this is a big overhaul.

@bwbroersma bwbroersma changed the title Dane validity failed Dane validity failed - use validating resolver for DANE existence subtest Apr 2, 2024
@bwbroersma bwbroersma added the bug Unexpected or unwanted behaviour of current implementations label Apr 2, 2024
@bwbroersma
Copy link
Collaborator

Part of the larger overhaul is in:

Agreed to only fix the validating resolver for the DANE existence subtest for now. Only question is, can the resolver change be done as backport (for v1.8.6) or should it be v1.9?

@mxsasha
Copy link
Collaborator

mxsasha commented Apr 3, 2024

I have looked into the code, and all the resolving is shared, and the DANE lookup may be performed both as part of do_mail_get_servers() or by the dane check itself. I'm wondering if we should not use the non-validating resolver anyways. I know this was discussed, but I'm not sure about the reasoning to set it up this way. However, I would prefer to pick this up properly as part of #1378, so that we can have one clear way to interact with DNS.

@baknu
Copy link
Contributor

baknu commented Apr 3, 2024

I'm wondering if we should not use the non-validating resolver anyways. I know this was discussed, but I'm not sure about the reasoning to set it up this way.

I believe the reasoning was that even if a domain is DNSSEC bogus, Internet.nl can still run all the other subtests that need DNS lookups. See for example: https://internet.nl/site/servfail.nl/2719028/
@gthess: Do you remember more about the reasoning?

Maybe a bit comparable: note that with RPKI invalids Internet.nl is not able to run other subtests (that require connectivity) because our hosting provider is filtering. See for example: https://internet.nl/site/invalid.rpki.isbgpsafeyet.com/2719039/.

Since DNSSEC bogus is an exception that should be resolved first before checking other test results, it seems indeed a good idea to simplify our setup and use only a validating resolver,

@gthess
Copy link
Collaborator

gthess commented Apr 5, 2024

@gthess: Do you remember more about the reasoning?

The non-validating Unbound context was there to give answers skipping DNSSEC validation altogether.
This was only for the prechecks IIRC.
A validating Unbound context does validation and returns both the data and the secure status, which in some tests like the mail auth test the secure bit is not considered just the data IIRC.

If all the contexts will be replaced by a stub, the stub needs to make sure to activate the CD flag (Checking Disabled) on its queries so that it gets also the bogus data back for other tests to work (like the prechecks and the mail auth test). The AD flag on the response would indicate if DNSSEC validation succeeded or failed.
(I'll add this last comment also to the stub replacing issue.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Unexpected or unwanted behaviour of current implementations
Development

No branches or pull requests

5 participants