Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delayed dns resolve after upgrade to v1.60.0 #7186

Open
sebastianjonasson opened this issue May 3, 2024 · 5 comments
Open

Delayed dns resolve after upgrade to v1.60.0 #7186

sebastianjonasson opened this issue May 3, 2024 · 5 comments
Assignees

Comments

@sebastianjonasson
Copy link

sebastianjonasson commented May 3, 2024

NOTE: if you are reporting is a potential security vulnerability or a crash,
please follow our CVE process at
https://github.com/grpc/proposal/blob/master/P4-grpc-cve-process.md instead of
filing an issue here.

Please see the FAQ in our main README.md, then answer the questions below
before submitting your issue.

What version of gRPC are you using?

1.60.0

What version of Go are you using (go version)?

1.22

What operating system (Linux, Windows, …) and version?

Ubuntu 22.04

What did you do?

Upgrade from v1.59.0 to v1.60.0 for grpc client service.
Clients now take 30 seconds to re-resolve upstream services on connection error (server shutting down).

What did you expect to see?

During deploy of upstream grpc server I expect to see no request failing due to delayed DNS resolve in the grpc client. New IP should instantly be resolved after first connection refused error: transport.

Example (from v1.59.0):

2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to IDLE
2024/05/02 13:45:19 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to CONNECTING
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel picks a new address "10.0.0.20:8000" to connect
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 13:45:19 WARNING: [core] [Channel #1 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.20:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.20:8000: connect: connection refused"
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.20:8000: connect: connection refused"
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE
2024/05/02 13:45:19 INFO: [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "10.0.0.40:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.0.0.40:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel created
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to SHUTDOWN
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #4] Subchannel deleted
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel Connectivity change to CONNECTING
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel picks a new address "10.0.0.40:8000" to connect
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 13:45:19 INFO: [core] [Channel #1 SubChannel #7] Subchannel Connectivity change to READY
2024/05/02 13:45:19 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:7):{{Addr: "10.0.0.40:8000", ServerName: "", }}]}
2024/05/02 13:45:19 INFO: [core] [Channel #1] Channel Connectivity change to READY

What did you see instead?

Tested with v1.60.0 and v1.63.2.

During deploy of upstream grpc server it's a 30 second delay between the first failing request until the new IP is resolved (see logs below).

I've tested this with both single and multiple instance setup for both grpc client and server. We've seen this error for services using a grpc client.

By testing each commit in v1.60.0 I found that the patch from PR #6668 was the first commit that caused this behavior.

Logs output from during deploy of upstream grpc server (v1.60.0):

2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE
2024/05/02 12:42:42 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/02 12:42:42 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:42 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 12:42:42 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:42 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:42 INFO: [core] [Channel #1] Channel Connectivity change to TRANSIENT_FAILURE
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:43 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:43 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:44 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:44 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:47 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:47 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:42:51 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:51 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:56 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: connect: connection refused"
2024/05/02 12:42:56 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to CONNECTING
2024/05/02 12:42:56 INFO: [core] [Channel #1 SubChannel #2] Subchannel picks a new address "10.0.0.140:8000" to connect
2024/05/02 12:43:12 INFO: [core] [Channel #1] Resolver state updated: {
"Addresses": [
{
"Addr": "10.0.0.160:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.0.0.160:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel created
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #2] Subchannel Connectivity change to SHUTDOWN
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #2] Subchannel deleted
2024/05/02 12:43:12 INFO: [core] [Channel #1] Channel Connectivity change to IDLE
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to CONNECTING
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel picks a new address "10.0.0.160:8000" to connect
2024/05/02 12:43:12 INFO: [core] [Channel #1] Channel Connectivity change to CONNECTING
2024/05/02 12:43:12 WARNING: [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "10.0.0.140:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.0.0.140:8000: operation was canceled"
2024/05/02 12:43:12 INFO: [core] [Channel #1 SubChannel #4] Subchannel Connectivity change to READY
2024/05/02 12:43:12 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:4):{{Addr: "10.0.0.160:8000", ServerName: "", }}]}
2024/05/02 12:43:12 INFO: [core] [Channel #1] Channel Connectivity change to READY

Common client setup for reference:

keepAliveParameters := keepalive.ClientParameters{
  Time:                10 * time.Second,
  Timeout:             time.Second,    
  PermitWithoutStream: true,          
}
defaultServiceConfig := `{"loadBalancingConfig": [{"round_robin":{}}]}`
resolver.SetDefaultScheme("dns")
ctx := context.Background()
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()
conn, err := grpc.DialContext(ctx,
  "our.interal.domain",
  grpc.WithTransportCredentials(insecure.NewCredentials()),
  grpc.WithBlock(),
  grpc.WithKeepaliveParams(grpchelper.keepAliveParameters),
  grpc.WithDefaultServiceConfig(defaultServiceConfig),
)
@purnesh42H
Copy link
Contributor

@easwars could you verify if the the patch from PR #6668 can cause this delay?

@easwars
Copy link
Contributor

easwars commented May 15, 2024

I can see how #6668 could be the problem maker. I understand the problem, and the fix required as well.

Is there a way to easily reproduce the problem though (without any complicated k8s setup)? That would be helpful to verify the fix. Thanks.

@sebastianjonasson
Copy link
Author

Thanks for the response @easwars!

Unfortunately I've not been able to reproduce this locally. All debugging have been done in our cloud based dev environment.

I've done some additional testing and found that the problem still remains on master. However, I've been able to get this to work by checking out master and reverting the watcher function to use a timer instead of time.After. Example here: https://github.com/sebastianjonasson/grpc-go/pull/1/files.

It seems like the issue is that the current implementation assigns the duration instead of starting the timer as the previous (<=v1.59.0) implementation did. This means that the current implementation will always wait 30s after receiving the <-rn message, while in the previous implementation the timer would have expired when receiving the <-rn message yielding an immediate resolve.

var waitTime time.Duration
if err == nil {
// Success resolving, wait for the next ResolveNow. However, also wait 30
// seconds at the very least to prevent constantly re-resolving.
backoffIndex = 1
waitTime = MinResolutionInterval
select {
case <-d.ctx.Done():
return
case <-d.rn:
}
} else {
// Poll on an error found in DNS Resolver or an error received from
// ClientConn.
waitTime = backoff.DefaultExponential.Backoff(backoffIndex)
backoffIndex++
}
select {
case <-d.ctx.Done():
return
case <-internal.TimeAfterFunc(waitTime):
}

To try to prove my case I've added the following debug logs for master and my test branch to highlight the behavior:

  • debug: start timer
  • debug: wait for timer
  • debug: timer done

As you can see in the logs for master we wait 30s upon receiving the <-rn message, while the logs for the test branch shows immediate resolution since the timer has already expired.

Logs master:

2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel switches to new LB policy "round_robin"
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel created
2024/05/17 09:30:48 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 09:30:48 INFO: [dns] debug: start timer
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:30:48 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY
2024/05/17 09:30:48 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:2):{{Addr: "10.10.10.40:8000", ServerName: "", }}]}
2024/05/17 09:30:48 INFO: [core] [Channel #1]Channel Connectivity change to READY
2024/05/17 09:32:05 INFO: [core] [Channel #1]Channel Connectivity change to SHUTDOWN
2024/05/17 09:32:05 INFO: [core] [Channel #1]Closing the name resolver
2024/05/17 09:32:05 INFO: [core] [Channel #1]ccBalancerWrapper: closing
2024/05/17 09:32:05 INFO: [core] [Channel #1 SubChannel #4]Subchannel Connectivity change to SHUTDOWN
2024/05/17 09:32:05 INFO: [core] [Channel #1 SubChannel #4]Subchannel deleted
2024/05/17 09:32:05 INFO: [core] [Channel #1]Channel deleted
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2024/05/17 09:37:16 INFO: [dns] debug: wait for timer
2024/05/17 09:37:16 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to IDLE
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 09:37:16 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:16 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:16 INFO: [core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:17 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:17 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:18 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:21 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:21 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:25 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.10.40:8000", ServerName: "our.internal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:25 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.10.40:8000: connect: connection refused"
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 09:37:32 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.10.40:8000" to connect
2024/05/17 09:37:46 INFO: [dns] debug: timer done
2024/05/17 09:37:46 INFO: [core] [Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "10.10.10.50:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.10.10.50:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()

Logs test branch:

2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel created
2024/05/17 11:35:54 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 11:35:54 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 11:35:54 INFO: [dns] debug: start timer
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.20.50:8000" to connect
2024/05/17 11:35:54 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to READY
2024/05/17 11:35:54 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[SubConn(id:2):{{Addr: "10.10.20.50:8000", ServerName: "", }}]}
2024/05/17 11:35:54 INFO: [core] [Channel #1]Channel Connectivity change to READY
2024/05/17 11:37:15 INFO: [core] [Channel #1]Channel Connectivity change to SHUTDOWN
2024/05/17 11:37:15 INFO: [core] [Channel #1]Closing the name resolver
2024/05/17 11:37:15 INFO: [core] [Channel #1]ccBalancerWrapper: closing
2024/05/17 11:37:15 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to SHUTDOWN
2024/05/17 11:37:15 INFO: [core] [Channel #1 SubChannel #2]Subchannel deleted
2024/05/17 11:37:15 INFO: [core] [Channel #1]Channel deleted
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to IDLE
2024/05/17 11:46:18 INFO: [dns] debug: wait for timer
2024/05/17 11:46:18 INFO: [dns] debug: timer done
2024/05/17 11:46:18 INFO: [roundrobin] roundrobinPicker: Build called with info: {map[]}
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to IDLE
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to CONNECTING
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel picks a new address "10.10.20.50:8000" to connect
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to CONNECTING
2024/05/17 11:46:18 WARNING: [core] [Channel #1 SubChannel #2]grpc: addrConn.createTransport failed to connect to {Addr: "10.10.20.50:8000", ServerName: "our.interal.domain:8000", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 10.10.20.50:8000: connect: connection refused"
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #2]Subchannel Connectivity change to TRANSIENT_FAILURE, last error: connection error: desc = "transport: Error while dialing: dial tcp 10.10.20.50:8000: connect: connection refused"
2024/05/17 11:46:18 INFO: [core] [Channel #1]Channel Connectivity change to TRANSIENT_FAILURE
2024/05/17 11:46:18 INFO: [core] [Channel #1]Resolver state updated: {
"Addresses": [
{
"Addr": "10.10.20.140:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Endpoints": [
{
"Addresses": [
{
"Addr": "10.10.20.140:8000",
"ServerName": "",
"Attributes": null,
"BalancerAttributes": null,
"Metadata": null
}
],
"Attributes": null
}
],
"ServiceConfig": null,
"Attributes": null
} ()
2024/05/17 11:46:18 INFO: [core] [Channel #1 SubChannel #4]Subchannel created

@easwars
Copy link
Contributor

easwars commented May 17, 2024

@sebastianjonasson : You are correct in your assessment about when the timer is created in the DNS resolver's watcher goroutine. I was also going to fix it the same way you have attempted to fix it. But I wanted to have a test for it, which is why I asked you if you had an easy reproduction. Since you don't have an easy repro, I will work on a unit test for this scenario. Thank you for reporting this issue and all the information you have provided along the way.

@sebastianjonasson
Copy link
Author

Great! Thanks for looking into this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants