Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL #6229

Open
tnqn opened this issue Apr 16, 2024 · 2 comments
Labels
kind/design Categorizes issue or PR as related to design.

Comments

@tnqn
Copy link
Member

tnqn commented Apr 16, 2024

Describe what you are trying to solve

I was working with @Scoobed to debug an issue of NetworkPolicy FQDN rule in a cluster that the Pod failed to connect to the FQDN intermittently. After realizing the application was based on Java, I found that in many cases JVM enabled a DNS cache which uses a configured TTL as below, instead of respecting the TTL value in the DNS response.

networkaddress.cache.ttl

Specified in java.security to indicate the caching policy for successful name lookups from the name service.. The value is specified as integer to indicate the number of seconds to cache the successful lookup.
A value of -1 indicates "cache forever". The default behavior is to cache forever when a security manager is installed, and to cache for an implementation specific period of time, when a security manager is not installed.

How the problem typically happened:

  1. Pod made a DNS request of a FQDN
  2. Antrea inspected the DNS response and associated the FQDN with the IPs in the response
  3. Pod connected one of the IPs successfully because Antrea was aware of the IP.
  4. Antrea refreshed the FQDN resolution, found the previous IPs were no longer present in the response, so it removed the IPs when it reached TTL set in the previous response.
  5. Pod tried to connect the FQDN another time, but it skipped querying the FQDN's IP due to its own cache (with a fixed TTL), it failed due to the IP was no longer allowed by datapath.

@Scoobed also confirmed that the problem was gone when using nodelocal dns, which should be due to the special handling in the buildpack that it disabled JVM DNS cache when it detects the DNS server is a link-local address: https://github.com/paketo-buildpacks/libjvm/blob/79182aa17fa3e49424f511dd0070dd66bdc1a3ec/helper/link_local_dns.go#L34-L64

As this may affect many Java based applications and not all clusters enable NodeLocal DNS, I have been thinking how to better support this scenario without requiring all application developers to disable their DNS cache or to respect TTL in DNS response (which is even harder than the former). One solution I come up with is to provide a configuration like minTTL, which determines the minimal TTL the DNS resolutions will be cached. If a DNS response's TTL is less than minTTL, the actual TTL in datapath will be minTTL. Note that the TTL cache is not per Pod, so the minTTL will be a global configuration which applies to all Pods (I don't think of any actual defect caused by it except for a few more memory consumption). Even different Pods can have different hard-coded DNS cache TTL, the minTTL can just be the maximum value of them. And typically it could just be set to the default value of JVM DNS TTL or bigger value.

Note that this still require application DNS cache not to cache forever.

Describe how your solution impacts user flows

The cluster admin should configure minTTL to be equal or larger than the maximum TTL values of application DNS caches.

Alternative solutions that you considered

Require users to disable application-level DNS cache.

Test plan

e2e: validate applications with DNS cache can stably access the target FQDN while FQDN resolution frequently changes.

@tnqn tnqn added the kind/design Categorizes issue or PR as related to design. label Apr 16, 2024
@tnqn
Copy link
Member Author

tnqn commented Apr 16, 2024

@jianjuns @antoninbas @Dyanngg please let me know how you think about the proposal.

@tnqn tnqn changed the title Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions themselves Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL Apr 16, 2024
@jianjuns
Copy link
Contributor

The proposal sounds good to me.

@tnqn tnqn added this to the Antrea v2.1 release milestone Apr 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design.
Projects
None yet
Development

No branches or pull requests

2 participants