Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible deadlock issue in logrus, or deadlock between the data writer and the DataIterator #280

Open
shuhaowu opened this issue Apr 30, 2021 · 0 comments

Comments

@shuhaowu
Copy link
Contributor

While testing, I've observed an intermittent test failure within the ruby test InterruptResumeTest#test_interrupt_resume_idempotence_with_writes_to_source. When debugging that, I can see that all the Ghostferry goroutine to be stuck at some logrus mutex methods[1]. The ruby process is always trying to modify a row a the same time via the data writer at the same time. This modification query appears to be always stuck waiting for lock, presumably because the data iterator just went over that row and issued a FOR UPDATE against that row (BEFORE_ROW_COPY and AFTER_ROW_COPY in the integrationferry is sent while the FOR UPDATE is held, which makes the situation worse).

I'm not quite sure why this is the case. It's possible that the problem lies in an upstream logrus issue (sirupsen/logrus#1201) or it's somehow a race between DataIterator which holds the lock while trying to send BEFORE_ROW_COPY and AFTER_ROW_COPY to the integration test server (reader should check integrationferry.go to see that, and check how the listeners on the data iterator + cursor works with the FOR UPDATE lock). I'm not sure which one is more likely, as maybe we should see the same deadlock in production if it is purely a problem with logrus.

I've found that by reducing the frequency of the data writer, the deadlock appears to be gone. However, it seems that even a very small sleep (5ms) solves this issue. Such a small sleep only reduces the total number of modification query from about 30 to 22 (for the duration over which the datawriter is active). I'm not sure why this make such a big difference, either.

A failure grinding script is included below, at [2]


[1] Ghostferry per go routine stack trace:

goroutine profile: total 19
3 @ 0x43c2e5 0x44d297 0x76ce25 0x472bc1
#	0x76ce24	github.com/go-sql-driver/mysql.(*mysqlConn).startWatcher.func1+0xa4	/home/shuhao/src/ghostferry/vendor/github.com/go-sql-driver/mysql/connection_go18.go:178

2 @ 0x43c2e5 0x43493b 0x46d175 0x4c9385 0x4ca135 0x4ca117 0x58020f 0x58c111 0x5543a2 0x47e067 0x721347 0x72130a 0x7210db 0x730bdc 0x472bc1
#	0x46d174	internal/poll.runtime_pollWait+0x54						/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/netpoll.go:222
#	0x4c9384	internal/poll.(*pollDesc).wait+0x44						/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4ca134	internal/poll.(*pollDesc).waitRead+0x1d4					/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4ca116	internal/poll.(*FD).Read+0x1b6							/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:166
#	0x58020e	net.(*netFD).Read+0x4e								/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/fd_posix.go:55
#	0x58c110	net.(*conn).Read+0x90								/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/net.go:183
#	0x5543a1	bufio.(*Reader).Read+0x221							/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/bufio/bufio.go:227
#	0x47e066	io.ReadAtLeast+0x86								/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/io/io.go:328
#	0x721346	io.ReadFull+0x86								/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/io/io.go:347
#	0x721309	github.com/siddontang/go-mysql/packet.(*Conn).ReadPacketTo+0x49			/home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/packet/conn.go:96
#	0x7210da	github.com/siddontang/go-mysql/packet.(*Conn).ReadPacket+0xda			/home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/packet/conn.go:85
#	0x730bdb	github.com/siddontang/go-mysql/replication.(*BinlogSyncer).onStream+0x2fb	/home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/replication/binlogsyncer.go:647

2 @ 0x43c2e5 0x44d297 0x700e35 0x472bc1
#	0x700e34	database/sql.(*DB).connectionOpener+0xb4	/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/database/sql/sql.go:1133

1 @ 0x40dfb4 0x46f3c5 0x779d65 0x472bc1
#	0x46f3c4	os/signal.signal_recv+0xa4	/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sigqueue.go:168
#	0x779d64	os/signal.loop+0x24		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/os/signal/signal_unix.go:23

1 @ 0x43c2e5 0x40824f 0x407e8b 0x77e2e5 0x7b2d77 0x472bc1
#	0x77e2e4	github.com/Shopify/ghostferry.(*BinlogWriter).Run+0x364	/home/shuhao/src/ghostferry/binlog_writer.go:36
#	0x7b2d76	github.com/Shopify/ghostferry.(*Ferry).Run.func7+0x56	/home/shuhao/src/ghostferry/ferry.go:705

1 @ 0x43c2e5 0x40824f 0x407e8b 0x7b29eb 0x472bc1
#	0x7b29ea	github.com/Shopify/ghostferry.(*Ferry).Run.func5+0xea	/home/shuhao/src/ghostferry/ferry.go:673

1 @ 0x43c2e5 0x43493b 0x46d175 0x4c9385 0x4cac12 0x4cabf4 0x5810c5 0x594f12 0x593ea5 0x67d765 0x67d47a 0x7c0938 0x7c0902 0x472bc1
#	0x46d174	internal/poll.runtime_pollWait+0x54		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/netpoll.go:222
#	0x4c9384	internal/poll.(*pollDesc).wait+0x44		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:87
#	0x4cac11	internal/poll.(*pollDesc).waitRead+0x211	/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_poll_runtime.go:92
#	0x4cabf3	internal/poll.(*FD).Accept+0x1f3		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:401
#	0x5810c4	net.(*netFD).accept+0x44			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/fd_unix.go:172
#	0x594f11	net.(*TCPListener).accept+0x31			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/tcpsock_posix.go:139
#	0x593ea4	net.(*TCPListener).Accept+0x64			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/tcpsock.go:261
#	0x67d764	net/http.(*Server).Serve+0x284			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2981
#	0x67d479	net/http.(*Server).ListenAndServe+0xb9		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2910
#	0x7c0937	net/http.ListenAndServe+0x57			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:3164
#	0x7c0901	main.main.func1+0x21				/home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:337

1 @ 0x43c2e5 0x44d297 0x72dc65 0x7afd6c 0x77b49e 0x7b2eda 0x472bc1
#	0x72dc64	github.com/siddontang/go-mysql/replication.(*BinlogStreamer).GetEvent+0x104	/home/shuhao/src/ghostferry/vendor/github.com/siddontang/go-mysql/replication/binlogstreamer.go:29
#	0x7afd6b	github.com/Shopify/ghostferry.(*BinlogStreamer).Run.func2+0x8b			/home/shuhao/src/ghostferry/binlog_streamer.go:164
#	0x77b49d	github.com/Shopify/ghostferry.(*BinlogStreamer).Run+0x19d			/home/shuhao/src/ghostferry/binlog_streamer.go:170
#	0x7b2ed9	github.com/Shopify/ghostferry.(*Ferry).Run.func9+0x59				/home/shuhao/src/ghostferry/ferry.go:721

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46eba5 0x47db85 0x792e57 0x7c0896 0x472bc1
#	0x46eba4	sync.runtime_Semacquire+0x44				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:56
#	0x47db84	sync.(*WaitGroup).Wait+0x64				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/waitgroup.go:130
#	0x792e56	github.com/Shopify/ghostferry.(*Ferry).Run+0x536	/home/shuhao/src/ghostferry/ferry.go:785
#	0x7c0895	main.(*IntegrationFerry).Main.func1+0x55		/home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:157

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46eba5 0x47db85 0x7bef68 0x7c0133 0x43beb6 0x472bc1
#	0x46eba4	sync.runtime_Semacquire+0x44		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:56
#	0x47db84	sync.(*WaitGroup).Wait+0x64		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/waitgroup.go:130
#	0x7bef67	main.(*IntegrationFerry).Main+0x247	/home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:169
#	0x7c0132	main.main+0x2d2				/home/shuhao/src/ghostferry/test/lib/go/integrationferry.go:340
#	0x43beb5	runtime.main+0x255			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/proc.go:225

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46ecc7 0x47c0e5 0x70e8ba 0x70e8a9 0x70e735 0x70ef66 0x796076 0x79601e 0x795d67 0x794865 0x7b2694 0x472bc1
#	0x46ecc6	sync.runtime_SemacquireMutex+0x46				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:71
#	0x47c0e4	sync.(*Mutex).lockSlow+0x104					/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:138
#	0x70e8b9	sync.(*Mutex).Lock+0x259					/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:81
#	0x70e8a8	github.com/sirupsen/logrus.(*MutexWrap).Lock+0x248		/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/logger.go:58
#	0x70e734	github.com/sirupsen/logrus.(*Entry).log+0xd4			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:233
#	0x70ef65	github.com/sirupsen/logrus.(*Entry).Log+0x85			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:293
#	0x796075	github.com/sirupsen/logrus.(*Entry).Debug+0x2b5			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:302
#	0x79601d	github.com/Shopify/ghostferry.postCallback+0x25d		/home/shuhao/src/ghostferry/http_callback.go:41
#	0x795d66	github.com/Shopify/ghostferry.HTTPCallback.Post+0xe6		/home/shuhao/src/ghostferry/http_callback.go:25
#	0x794864	github.com/Shopify/ghostferry.(*Ferry).ReportProgress+0x2a4	/home/shuhao/src/ghostferry/ferry.go:1028
#	0x7b2693	github.com/Shopify/ghostferry.(*Ferry).Run.func3+0x93		/home/shuhao/src/ghostferry/ferry.go:644

1 @ 0x43c2e5 0x44dd65 0x44dd4e 0x46ecc7 0x47c0e5 0x70e8ba 0x70e8a9 0x70e735 0x70ef66 0x796076 0x79601e 0x795d67 0x794b54 0x7b2814 0x472bc1
#	0x46ecc6	sync.runtime_SemacquireMutex+0x46				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/sema.go:71
#	0x47c0e4	sync.(*Mutex).lockSlow+0x104					/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:138
#	0x70e8b9	sync.(*Mutex).Lock+0x259					/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/sync/mutex.go:81
#	0x70e8a8	github.com/sirupsen/logrus.(*MutexWrap).Lock+0x248		/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/logger.go:58
#	0x70e734	github.com/sirupsen/logrus.(*Entry).log+0xd4			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:233
#	0x70ef65	github.com/sirupsen/logrus.(*Entry).Log+0x85			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:293
#	0x796075	github.com/sirupsen/logrus.(*Entry).Debug+0x2b5			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:302
#	0x79601d	github.com/Shopify/ghostferry.postCallback+0x25d		/home/shuhao/src/ghostferry/http_callback.go:41
#	0x795d66	github.com/Shopify/ghostferry.HTTPCallback.Post+0xe6		/home/shuhao/src/ghostferry/http_callback.go:25
#	0x794b53	github.com/Shopify/ghostferry.(*Ferry).ReportState+0x133	/home/shuhao/src/ghostferry/ferry.go:1044
#	0x7b2813	github.com/Shopify/ghostferry.(*Ferry).Run.func4+0x93		/home/shuhao/src/ghostferry/ferry.go:662

1 @ 0x46cd7d 0x6e5dae 0x6e5b85 0x6e2712 0x6f3fe5 0x6f5837 0x67a584 0x67c40d 0x67d383 0x6792cd 0x472bc1
#	0x46cd7c	runtime/pprof.runtime_goroutineProfileWithLabels+0x5c	/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/mprof.go:716
#	0x6e5dad	runtime/pprof.writeRuntimeProfile+0xcd			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/pprof/pprof.go:724
#	0x6e5b84	runtime/pprof.writeGoroutine+0xa4			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/pprof/pprof.go:684
#	0x6e2711	runtime/pprof.(*Profile).WriteTo+0x3f1			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/runtime/pprof/pprof.go:331
#	0x6f3fe4	net/http/pprof.handler.ServeHTTP+0x384			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/pprof/pprof.go:253
#	0x6f5836	net/http/pprof.Index+0x8d6				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/pprof/pprof.go:371
#	0x67a583	net/http.HandlerFunc.ServeHTTP+0x43			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2069
#	0x67c40c	net/http.(*ServeMux).ServeHTTP+0x1ac			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2448
#	0x67d382	net/http.serverHandler.ServeHTTP+0xa2			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:2887
#	0x6792cc	net/http.(*conn).serve+0x8cc				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:1952

1 @ 0x4b2845 0x4b0bfa 0x4ca0a5 0x4ca07b 0x4ca061 0x58020f 0x58c111 0x672ad8 0x472bc1
#	0x4b2844	syscall.Syscall+0x4				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/asm_linux_amd64.s:20
#	0x4b0bf9	syscall.read+0x59				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/zsyscall_linux_amd64.go:686
#	0x4ca0a4	syscall.Read+0x144				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/syscall_unix.go:187
#	0x4ca07a	internal/poll.ignoringEINTRIO+0x11a		/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:581
#	0x4ca060	internal/poll.(*FD).Read+0x100			/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:162
#	0x58020e	net.(*netFD).Read+0x4e				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/fd_posix.go:55
#	0x58c110	net.(*conn).Read+0x90				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/net.go:183
#	0x672ad7	net/http.(*connReader).backgroundRead+0x57	/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/net/http/server.go:692

1 @ 0x4b2845 0x4b0dfa 0x4ca65a 0x4ca642 0x4ca744 0x4cf42e 0x4cf3fe 0x70edb8 0x70e7eb 0x70ef66 0x77b896 0x77b83e 0x7b2e1b 0x472bc1
#	0x4b2844	syscall.Syscall+0x4						/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/asm_linux_amd64.s:20
#	0x4b0df9	syscall.write+0x59						/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/zsyscall_linux_amd64.go:914
#	0x4ca659	syscall.Write+0x139						/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/syscall/syscall_unix.go:212
#	0x4ca641	internal/poll.ignoringEINTRIO+0x121				/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:581
#	0x4ca743	internal/poll.(*FD).Write+0x223					/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/internal/poll/fd_unix.go:274
#	0x4cf42d	os.(*File).write+0x8d						/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/os/file_posix.go:48
#	0x4cf3fd	os.(*File).Write+0x5d						/nix/store/39r82b5cjnikrib8a2p9fp2i039qkwxi-go-1.16.3/share/go/src/os/file.go:174
#	0x70edb7	github.com/sirupsen/logrus.(*Entry).write+0x157			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:286
#	0x70e7ea	github.com/sirupsen/logrus.(*Entry).log+0x18a			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:251
#	0x70ef65	github.com/sirupsen/logrus.(*Entry).Log+0x85			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:293
#	0x77b895	github.com/sirupsen/logrus.(*Entry).Debug+0x595			/home/shuhao/src/ghostferry/vendor/github.com/sirupsen/logrus/entry.go:302
#	0x77b83d	github.com/Shopify/ghostferry.(*BinlogStreamer).Run+0x53d	/home/shuhao/src/ghostferry/binlog_streamer.go:187
#	0x7b2e1a	github.com/Shopify/ghostferry.(*Ferry).Run.func8+0x5a		/home/shuhao/src/ghostferry/ferry.go:712

[2] Failure grinding script

Before running this, you should comment out the sleep in test/helpers/data_writer_helper.rb, after the call to write_data.

#!/bin/bash

i=0
while true; do
  i=$((i+1))
  echo "testing run $i"
  DEBUG=1 bundle exec ruby test/main.rb -v -n "InterruptResumeTest#test_interrupt_resume_idempotence_with_writes_to_source" 2>&1 >test.log
  if [ "$?" == "1" ]; then
    tail -n15 test.log
    echo "bad test found!"
    exit 0
  else
    grep "data writes" test.log
    tail -n15 test.log
    echo "test run $i succeeded, trying again"
  fi
done
shuhaowu added a commit that referenced this issue Apr 30, 2021
Slower data writer might reduce the race condition in logrus. See #280.
shuhaowu added a commit that referenced this issue Apr 30, 2021
Slower data writer might reduce the race condition in logrus. See #280.
shuhaowu added a commit that referenced this issue Apr 30, 2021
Slower data writer might reduce the race condition in logrus. See #280.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant