Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy does not recover after "Stopping listening" #888

Open
nh2 opened this issue Sep 12, 2022 · 8 comments
Open

Proxy does not recover after "Stopping listening" #888

nh2 opened this issue Sep 12, 2022 · 8 comments
Labels

Comments

@nh2
Copy link
Contributor

nh2 commented Sep 12, 2022

What happened:

There was a temporary Consul downtime in my cluster. After Stopping listening, the stolon-proxy does not recover:

2022-09-11T22:34:54.630Z        INFO        cmd/proxy.go:304        check timeout timer 
2022-09-11T22:34:54.630Z        INFO        cmd/proxy.go:158        Stopping listening

No more output for multiple hours, even though Consul evidently recovered (Consul checks on the same machine succeeded).

What you expected to happen:

That after Stopping listening it eventually starts listening again, as written on #76 it says:

It will restart listening on successful etcd communication and a populated proxyConf.

How to reproduce it (as minimally and precisely as possible):

Not sure yet, I observed it after a temporary network hickup that temporarily made Consul connections fail.

Anything else we need to know?:

Environment:

  • Stolon version: 0.16.0
  • Stolon running environment (if useful to understand the bug): NixOS Linux

Related links

@nh2 nh2 added the bug label Sep 12, 2022
@nh2
Copy link
Contributor Author

nh2 commented Sep 12, 2022

I worked around it by simply restarting that stolon-proxy, in case that gives a hint.

@nh2
Copy link
Contributor Author

nh2 commented Sep 12, 2022

Another related question (extracted from #859 (comment)):

What happens after check timeout timer fired + Stopping listening occurred, should the proxy ever reconnect/recover (the quoted comment above suggests yes)?

If yes, does that depend on what --stop-listening is set to, or should it recover in both true/false cases?

@nh2
Copy link
Contributor Author

nh2 commented Apr 13, 2023

This occurred to me again today on Stolon master commit 4bb4107 after a short network downtime, and paged me out of bed tonight.


I sent SIGQUIT to the process to dump the goroutines:

SIGQUIT: quit
PC=0x470b60 m=0 sigcode=0

goroutine 0 [idle]:
runtime.epollwait(0x4, 0x7ffeb0f30950, 0x138300000080, 0x0, 0x1383, 0x2, 0x0, 0x7ffeb0f30e20, 0xffffffff, 0x0, ...)
        runtime/sys_linux_amd64.s:725 +0x20
runtime.netpoll(0x129be4881, 0x0)
        runtime/netpoll_epoll.go:126 +0x92
runtime.findrunnable(0xc000050800, 0x0)
        runtime/proc.go:2923 +0x3ee
runtime.schedule()
        runtime/proc.go:3169 +0x2d7
runtime.park_m(0xc000001800)
        runtime/proc.go:3318 +0x9d
runtime.mcall(0x0)
        runtime/asm_amd64.s:327 +0x5b

goroutine 1 [select, 16591 minutes]:
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start(0xc00003c500, 0x8, 0x0)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:338 +0x171
github.com/sorintlab/stolon/cmd/proxy/cmd.proxy(0x23262c0, 0xc000166460, 0x0, 0x7)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:435 +0x316
github.com/spf13/cobra.(*Command).execute(0x23262c0, 0xc00003c090, 0x7, 0x7, 0x23262c0, 0xc00003c090)
        github.com/spf13/cobra@v0.0.5/command.go:830 +0x2c2
github.com/spf13/cobra.(*Command).ExecuteC(0x23262c0, 0x17c27e4, 0x7, 0x0)
        github.com/spf13/cobra@v0.0.5/command.go:914 +0x30b
github.com/spf13/cobra.(*Command).Execute(...)
        github.com/spf13/cobra@v0.0.5/command.go:864
github.com/sorintlab/stolon/cmd/proxy/cmd.Execute()
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:368 +0xd5
main.main()
        github.com/sorintlab/stolon/cmd/proxy/main.go:22 +0x25

goroutine 19 [chan receive]:
k8s.io/klog.(*loggingT).flushDaemon(0x233b0e0)
        k8s.io/klog@v1.0.0/klog.go:1010 +0x8b
created by k8s.io/klog.init.0
        k8s.io/klog@v1.0.0/klog.go:411 +0xd8

goroutine 7 [select, 16591 minutes]:
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).TimeoutChecker(0xc00003c500, 0xc000102180)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:302 +0xe5
created by github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:335 +0xc5

goroutine 599312 [chan send, 37123 minutes]:
github.com/sorintlab/pollon.(*Proxy).accepter(0xc000097080)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:143 +0x1d5
created by github.com/sorintlab/pollon.(*Proxy).Start
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:162 +0x67

goroutine 21 [chan send, 37203 minutes]:
github.com/sorintlab/pollon.(*Proxy).accepter(0xc00007f260)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:143 +0x1d5
created by github.com/sorintlab/pollon.(*Proxy).Start
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:162 +0x67

goroutine 846743 [select, 16591 minutes]:
net/http.(*persistConn).roundTrip(0xc0004446c0, 0xc0000b1200, 0x0, 0x0, 0x0)
        net/http/transport.go:2610 +0x765
net/http.(*Transport).roundTrip(0x2324da0, 0xc0001f0700, 0x30, 0x17188a0, 0x1388)
        net/http/transport.go:592 +0xacb
net/http.(*Transport).RoundTrip(0x2324da0, 0xc0001f0700, 0x2324da0, 0x0, 0x0)
        net/http/roundtrip.go:17 +0x35
net/http.send(0xc0001f0700, 0x1995b60, 0x2324da0, 0x0, 0x0, 0x0, 0xc000010150, 0xf8, 0x1, 0x0)
        net/http/client.go:251 +0x454
net/http.(*Client).send(0x233a680, 0xc0001f0700, 0x0, 0x0, 0x0, 0xc000010150, 0x0, 0x1, 0x1718801)
        net/http/client.go:175 +0xff
net/http.(*Client).do(0x233a680, 0xc0001f0700, 0x0, 0x0, 0x0)
        net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        net/http/client.go:585
github.com/hashicorp/consul/api.(*Client).doRequest(0xc000130a20, 0xc0004abb60, 0xc00046f728, 0xc00004a460, 0x41, 0xc0004abb60)
        github.com/hashicorp/consul/api@v1.4.0/api.go:873 +0xbe
github.com/hashicorp/consul/api.(*KV).getInternal(0xc00046f870, 0xc0001ba941, 0x3a, 0x0, 0x0, 0xc0001ba940, 0x40f71b, 0xc000180000, 0x0)
        github.com/hashicorp/consul/api@v1.4.0/kv.go:131 +0x2b4
github.com/hashicorp/consul/api.(*KV).Get(0xc00046f870, 0xc0001ba941, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
        github.com/hashicorp/consul/api@v1.4.0/kv.go:65 +0xa5
github.com/docker/libkv/store/consul.(*Consul).getActiveSession(0xc00000e390, 0xc0001ba941, 0x3a, 0x3b, 0x3a, 0x3a, 0x1)
        github.com/docker/libkv@v0.2.1/store/consul/consul.go:159 +0x65
github.com/docker/libkv/store/consul.(*Consul).renewSession(0xc00000e390, 0xc00046f9f0, 0x6fc23ac00, 0xc0001ba941, 0x3a)
        github.com/docker/libkv@v0.2.1/store/consul/consul.go:120 +0x5c
github.com/docker/libkv/store/consul.(*Consul).Put(0xc00000e390, 0xc0001ba8c0, 0x3a, 0xc000102de0, 0x52, 0x60, 0xc000457390, 0xc00046fad0, 0x40fe78)
        github.com/docker/libkv@v0.2.1/store/consul/consul.go:204 +0x136
github.com/sorintlab/stolon/internal/store.(*libKVStore).Put(0xc0000730e0, 0x19c8468, 0xc000040038, 0xc0001ba8c0, 0x3a, 0xc000102de0, 0x52, 0x60, 0xc000457388, 0xc0001c2008, ...)
        github.com/sorintlab/stolon/internal/store/libkv.go:50 +0x90
github.com/sorintlab/stolon/internal/store.(*KVBackedStore).SetProxyInfo(0xc00004c660, 0x19c8468, 0xc000040038, 0xc000235920, 0x6fc23ac00, 0x1, 0x1)
        github.com/sorintlab/stolon/internal/store/kvbacked.go:328 +0x1e4
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).SetProxyInfo(0xc00003c500, 0x19d89b8, 0xc00004c660, 0x2e, 0x37e11d600, 0x0, 0x0)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:183 +0x1d1
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Check(0xc00003c500, 0xc0000a3450, 0xc0003dae40)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:269 +0x865
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start.func1(0xc0001021e0, 0xc00003c500)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:341 +0x2b
created by github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:340 +0x4d5

goroutine 819159 [chan send, 16591 minutes]:
github.com/sorintlab/pollon.(*Proxy).accepter(0xc000495a40)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:143 +0x1d5
created by github.com/sorintlab/pollon.(*Proxy).Start
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:162 +0x67

goroutine 694040 [chan send, 27545 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc0001c3640, 0xc000102960)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 819954 [IO wait, 16591 minutes]:
internal/poll.runtime_pollWait(0x7f7634f24278, 0x72, 0xffffffffffffffff)
        runtime/netpoll.go:222 +0x55
internal/poll.(*pollDesc).wait(0xc00003c798, 0x72, 0x1000, 0x1000, 0xffffffffffffffff)
        internal/poll/fd_poll_runtime.go:87 +0x45
internal/poll.(*pollDesc).waitRead(...)
        internal/poll/fd_poll_runtime.go:92
internal/poll.(*FD).Read(0xc00003c780, 0xc000024000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        internal/poll/fd_unix.go:166 +0x1d5
net.(*netFD).Read(0xc00003c780, 0xc000024000, 0x1000, 0x1000, 0x43bb1c, 0xc00031bc38, 0x467040)
        net/fd_posix.go:55 +0x4f
net.(*conn).Read(0xc00019a168, 0xc000024000, 0x1000, 0x1000, 0x0, 0x0, 0x0)
        net/net.go:183 +0x91
net/http.(*persistConn).Read(0xc0004446c0, 0xc000024000, 0x1000, 0x1000, 0xc000216ae0, 0xc00031bd40, 0x4074d5)
        net/http/transport.go:1922 +0x77
bufio.(*Reader).fill(0xc0001ad200)
        bufio/bufio.go:101 +0x108
bufio.(*Reader).Peek(0xc0001ad200, 0x1, 0x0, 0x1, 0x4, 0x1, 0x3)
        bufio/bufio.go:139 +0x4f
net/http.(*persistConn).readLoop(0xc0004446c0)
        net/http/transport.go:2083 +0x1a8
created by net/http.(*Transport).dialConn
        net/http/transport.go:1743 +0xc77

goroutine 811246 [chan send, 19509 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc0001c3140, 0xc000102840)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 819955 [select, 16591 minutes]:
net/http.(*persistConn).writeLoop(0xc0004446c0)
        net/http/transport.go:2382 +0xf7
created by net/http.(*Transport).dialConn
        net/http/transport.go:1744 +0xc9c

goroutine 619321 [chan send, 35525 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc0004bc2f0, 0xc000103020)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 773507 [chan send, 22424 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc000523ea0, 0xc0000c6ae0)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 811460 [chan send, 18874 minutes]:
github.com/sorintlab/pollon.(*Proxy).accepter(0xc000097860)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:143 +0x1d5
created by github.com/sorintlab/pollon.(*Proxy).Start
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:162 +0x67

goroutine 556673 [chan send, 37203 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc0003df330, 0xc0000c7080)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 715225 [chan send, 19509 minutes]:
github.com/sorintlab/pollon.(*Proxy).accepter(0xc000459680)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:143 +0x1d5
created by github.com/sorintlab/pollon.(*Proxy).Start
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:162 +0x67

goroutine 819037 [chan send, 18874 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc000514030, 0xc000103500)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 599966 [chan send, 37123 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc00056fc04, 0xc000102660)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 600313 [chan send, 35525 minutes]:
github.com/sorintlab/pollon.(*Proxy).accepter(0xc0004ab560)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:143 +0x1d5
created by github.com/sorintlab/pollon.(*Proxy).Start
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:162 +0x67

goroutine 819917 [chan send, 18812 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc0004928f4, 0xc000102360)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 619461 [chan send, 27545 minutes]:
github.com/sorintlab/pollon.(*Proxy).accepter(0xc00049a360)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:143 +0x1d5
created by github.com/sorintlab/pollon.(*Proxy).Start
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:162 +0x67

goroutine 105872 [chan send, 78161 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc000487060, 0xc000103260)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

goroutine 845736 [chan send, 16591 minutes]:
github.com/sorintlab/pollon.(*Proxy).proxyConn.func5(0xc00057cc50, 0xc0001025a0)
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:109 +0x45
created by github.com/sorintlab/pollon.(*Proxy).proxyConn
        github.com/sorintlab/pollon@v0.0.0-20181009091703-248c68238c16/pollon.go:107 +0x2e5

rax    0xfffffffffffffffc
rbx    0x1383
rcx    0x470b60
rdx    0x80
rdi    0x4
rsi    0x7ffeb0f30950
rbp    0x7ffeb0f30f50
rsp    0x7ffeb0f30900
r8     0x0
r9     0x3
r10    0x1383
r11    0x246
r12    0x7ffeb0f30f40
r13    0x3d9a0
r14    0x0
r15    0x0
rip    0x470b60
rflags 0x246
cs     0x33
fs     0x0
gs     0x0

@nh2
Copy link
Contributor Author

nh2 commented Apr 13, 2023

I have a suspicion of what's going on.

Reducing the above goroutine dump to the 3 goroutines that I think are relevant:

goroutine 1 [select, 16591 minutes]:
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start(0xc00003c500, 0x8, 0x0)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:338 +0x171
	...

goroutine 7 [select, 16591 minutes]:
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).TimeoutChecker(0xc00003c500, 0xc000102180)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:302 +0xe5
created by github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:335 +0xc5

goroutine 846743 [select, 16591 minutes]:
net/http.(*persistConn).roundTrip(0xc0004446c0, 0xc0000b1200, 0x0, 0x0, 0x0)
        net/http/transport.go:2610 +0x765
net/http.(*Transport).roundTrip(0x2324da0, 0xc0001f0700, 0x30, 0x17188a0, 0x1388)
        net/http/transport.go:592 +0xacb
net/http.(*Transport).RoundTrip(0x2324da0, 0xc0001f0700, 0x2324da0, 0x0, 0x0)
        net/http/roundtrip.go:17 +0x35
net/http.send(0xc0001f0700, 0x1995b60, 0x2324da0, 0x0, 0x0, 0x0, 0xc000010150, 0xf8, 0x1, 0x0)
        net/http/client.go:251 +0x454
net/http.(*Client).send(0x233a680, 0xc0001f0700, 0x0, 0x0, 0x0, 0xc000010150, 0x0, 0x1, 0x1718801)
        net/http/client.go:175 +0xff
net/http.(*Client).do(0x233a680, 0xc0001f0700, 0x0, 0x0, 0x0)
        net/http/client.go:717 +0x45f
net/http.(*Client).Do(...)
        net/http/client.go:585
github.com/hashicorp/consul/api.(*Client).doRequest(0xc000130a20, 0xc0004abb60, 0xc00046f728, 0xc00004a460, 0x41, 0xc0004abb60)
        github.com/hashicorp/consul/api@v1.4.0/api.go:873 +0xbe
github.com/hashicorp/consul/api.(*KV).getInternal(0xc00046f870, 0xc0001ba941, 0x3a, 0x0, 0x0, 0xc0001ba940, 0x40f71b, 0xc000180000, 0x0)
        github.com/hashicorp/consul/api@v1.4.0/kv.go:131 +0x2b4
github.com/hashicorp/consul/api.(*KV).Get(0xc00046f870, 0xc0001ba941, 0x3a, 0x0, 0x0, 0x0, 0x0, 0x0)
        github.com/hashicorp/consul/api@v1.4.0/kv.go:65 +0xa5
github.com/docker/libkv/store/consul.(*Consul).getActiveSession(0xc00000e390, 0xc0001ba941, 0x3a, 0x3b, 0x3a, 0x3a, 0x1)
        github.com/docker/libkv@v0.2.1/store/consul/consul.go:159 +0x65
github.com/docker/libkv/store/consul.(*Consul).renewSession(0xc00000e390, 0xc00046f9f0, 0x6fc23ac00, 0xc0001ba941, 0x3a)
        github.com/docker/libkv@v0.2.1/store/consul/consul.go:120 +0x5c
github.com/docker/libkv/store/consul.(*Consul).Put(0xc00000e390, 0xc0001ba8c0, 0x3a, 0xc000102de0, 0x52, 0x60, 0xc000457390, 0xc00046fad0, 0x40fe78)
        github.com/docker/libkv@v0.2.1/store/consul/consul.go:204 +0x136
github.com/sorintlab/stolon/internal/store.(*libKVStore).Put(0xc0000730e0, 0x19c8468, 0xc000040038, 0xc0001ba8c0, 0x3a, 0xc000102de0, 0x52, 0x60, 0xc000457388, 0xc0001c2008, ...)
        github.com/sorintlab/stolon/internal/store/libkv.go:50 +0x90
github.com/sorintlab/stolon/internal/store.(*KVBackedStore).SetProxyInfo(0xc00004c660, 0x19c8468, 0xc000040038, 0xc000235920, 0x6fc23ac00, 0x1, 0x1)
        github.com/sorintlab/stolon/internal/store/kvbacked.go:328 +0x1e4
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).SetProxyInfo(0xc00003c500, 0x19d89b8, 0xc00004c660, 0x2e, 0x37e11d600, 0x0, 0x0)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:183 +0x1d1
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Check(0xc00003c500, 0xc0000a3450, 0xc0003dae40)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:269 +0x865
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start.func1(0xc0001021e0, 0xc00003c500)
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:341 +0x2b
created by github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).Start
        github.com/sorintlab/stolon/cmd/proxy/cmd/proxy.go:340 +0x4d5

The code is written with the assumption that Check() will eventually complete; but is permanently stuck.

Thus deadlock.

@nh2
Copy link
Contributor Author

nh2 commented Apr 13, 2023

What is the value of proxyTimeout is passed to c.SetProxyInfo(..., proxyTimeout)?

From

github.com/sorintlab/stolon/internal/store.(*KVBackedStore).SetProxyInfo(0xc00004c660, 0x19c8468, 0xc000040038, 0xc000235920, 0x6fc23ac00, 0x1, 0x1)
github.com/sorintlab/stolon/cmd/proxy/cmd.(*ClusterChecker).SetProxyInfo(0xc00003c500, 0x19d89b8, 0xc00004c660, 0x2e, 0x37e11d600, 0x0, 0x0)

I conclude that the value is 0xc00004c660 == 15000000000, and that's supported by the fact that KVBackedStore.SetProxyInfo() has 0x6fc23ac00 which is 30000000000, which is 2*proxyTimeout, which is exactly what's done here.

So that means the outer call is c.SetProxyInfo(..., proxyTimeout = 15 seconds).

But it is stuck for 16591 minutes == 11.5 days, which is also what I observed from the logs (it's been this long between the Stopping listening and the time I took the stack trace).

@nh2
Copy link
Contributor Author

nh2 commented Apr 13, 2023

I blieve this issue is already tangentially mentioned in this comment:

// TODO(sgotti) TimeoutCecker is needed to forcefully close connection also
// if the Check method is blocked somewhere.
// The idomatic/cleaner solution will be to use a context instead of this
// TimeoutChecker but we have to change the libkv stores to support contexts.
go c.TimeoutChecker(checkOkCh)

if the Check method is blocked somewhere -- yes.

But as the comment says, the only thing that TimeoutChecker achieves is to forcefully close connection.
Afterwards, the stolon-proxy will never recover when Check() is blocked.

@nh2
Copy link
Contributor Author

nh2 commented Apr 13, 2023

PR to work around the issue: #907

@nh2
Copy link
Contributor Author

nh2 commented Apr 13, 2023

I managed to reproduce this issue by introducing artificial latency on the VPN network interface over which my consul runs:

tc qdisc add dev nebula.servers root netem delay 5s

# To remove it afterwards:
tc qdisc del dev nebula.servers root netem

After a while this would result in check timeout timer fired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant