Make NetworkPolicy FQDN rule work with applications that cache DNS resolutions with hard-coded TTL #6229
Labels
kind/design
Categorizes issue or PR as related to design.
Milestone
Describe what you are trying to solve
I was working with @Scoobed to debug an issue of NetworkPolicy FQDN rule in a cluster that the Pod failed to connect to the FQDN intermittently. After realizing the application was based on Java, I found that in many cases JVM enabled a DNS cache which uses a configured TTL as below, instead of respecting the TTL value in the DNS response.
How the problem typically happened:
@Scoobed also confirmed that the problem was gone when using nodelocal dns, which should be due to the special handling in the buildpack that it disabled JVM DNS cache when it detects the DNS server is a link-local address: https://github.com/paketo-buildpacks/libjvm/blob/79182aa17fa3e49424f511dd0070dd66bdc1a3ec/helper/link_local_dns.go#L34-L64
As this may affect many Java based applications and not all clusters enable NodeLocal DNS, I have been thinking how to better support this scenario without requiring all application developers to disable their DNS cache or to respect TTL in DNS response (which is even harder than the former). One solution I come up with is to provide a configuration like
minTTL
, which determines the minimal TTL the DNS resolutions will be cached. If a DNS response's TTL is less thanminTTL
, the actual TTL in datapath will beminTTL
. Note that the TTL cache is not per Pod, so theminTTL
will be a global configuration which applies to all Pods (I don't think of any actual defect caused by it except for a few more memory consumption). Even different Pods can have different hard-coded DNS cache TTL, theminTTL
can just be the maximum value of them. And typically it could just be set to the default value of JVM DNS TTL or bigger value.Note that this still require application DNS cache not to cache forever.
Describe how your solution impacts user flows
The cluster admin should configure
minTTL
to be equal or larger than the maximum TTL values of application DNS caches.Alternative solutions that you considered
Require users to disable application-level DNS cache.
Test plan
e2e: validate applications with DNS cache can stably access the target FQDN while FQDN resolution frequently changes.
The text was updated successfully, but these errors were encountered: