Poor multithreaded performance #343

pikrzysztof · 2022-11-23T14:25:49Z

I have an application which was reading capnproto structures from a network and feeding it to ~600 goroutines which would try analyzing it (read-only) in parallel. This caused a massive lock contention, where according to perf, ReadLimiter's spinlock only was responsible for 40% of cumulative CPU usage, please see flamegraph.html.gz (compressed to avoid github from stripping embedded JS)

Once we adapted 1b552dd application's CPU usage dropped more than 10x, from 60 cores to <6 cores.

I think this library might benefit from replacing spinlocks with sleeping locks. Spinlocks are generally not the best choice for userspace where you can't guarantee having the CPU while holding the lock. In case your goroutine gets the lock and gets descheduled other application threads will be stupidly burning CPU cycles until both kernel and golang scheduler agree to give CPU to the goroutine holding the lock. Another disadvantage is that CPU usage poorly correlates with the amount of actual work achieved and can go up unbounded.

The text was updated successfully, but these errors were encountered:

lthibault · 2022-11-23T16:20:59Z

@pikrzysztof You have impeccable timing. @zenhack and I are scheduled to pair up today and dig into performance-tuning for the RPC system. Would you be willing to share your application/benchmark code?

In the meantime, can you confirm which code revision you are using?

pikrzysztof · 2022-11-23T16:24:06Z

@lthibault Thank you for a prompt response. I can try writing some synthetic benchmark, but I cannot share the original code, but this may take a while.

I was using latest v2 release, i.e. zombiezen.com/go/capnproto2 v2.18.2+incompatible.

ocdtrekkie · 2022-11-23T16:30:44Z

I think this was when Kenton hit this on the C++ implementation: https://twitter.com/kentonvarda/status/1553120336018870272

lthibault · 2022-11-23T18:22:43Z

@pikrzysztof I can't recall if this specific issue has evolved between v2 and v3, but I would strongly encourage you to switch to v3. Any improvements are going to happen there. It's currently labeled as alpha, but I'm currently running it in production and haven't had any major issues. YMMV, of course.

pikrzysztof · 2022-11-23T18:29:15Z

@lthibault the spinlock is still there, please see https://github.com/capnproto/go-capnproto2/blob/0a7a48ce93a8cdf8c932689d71edc812a9e1e617/message.go#L152-L163

lthibault · 2022-11-23T18:36:37Z

So it is. Thanks!

We're focusing our initial perf-tuning efforts on memory allocations in the rpc package, but this spinlock issue definitely seems like something worth investigating afterwards. I don't have an ETA, unfortunately, but I am more than happy to provide support if you want to take a stab at a PR. That would likely be the fastest way of getting this change incorporated.

A synthetic benchmark would indeed be helpful in the meantime.

zenhack · 2022-11-23T18:38:13Z

That's not a spinlock; the compare & swap isn't implementing a lock it's just retrying if the value was modified from underneath it. We could probably make this a better by using atomic.AddInt64 instead of having a loop that retries (and changing the limit to be signed, and checking for negatives).

...but it is quite sad that we even need to do atomic operations for reads in the first place. The more invasive change that I think might make sense would be to instead of having a single global limit attached to the message, have a limit that is tracked separately, and can then just be per-goroutine. Maybe store a pointer to it in types like Ptr, Struct, etc. and allocate it when you do Message.Root() and such. There are a few API design questions to be worked out. But I think this would be worth doing before we tag 3.0 so we have the option of making breaking changes.

pikrzysztof · 2022-11-23T19:40:43Z

Here's a quick & dirty benchmark that hits the hotpath: https://github.com/pikrzysztof/gocapnp/blob/master/main_test.go

Please note that this is a benchmark on a laptop CPU. Scaling across sockets can be even worse.

-> % go test -bench=. -cpuprofile profile.out
goos: linux
goarch: amd64
pkg: github.com/pikrzysztof/capnp-benchmark
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkOne-8    	16024563	      139.0 ns/op
BenchmarkFour-8   	5696427	      322.9 ns/op
BenchmarkTwo-8    	10138142	      213.8 ns/op
PASS
ok  	github.com/pikrzysztof/capnp-benchmark	6.830s

-> % go tool pprof profile.out
(pprof) top
Showing nodes accounting for 10360ms, 68.75% of 15070ms total
Dropped 150 nodes (cum <= 75.35ms)
Showing top 10 nodes out of 72
      flat  flat%   sum%        cum   cum%
    3460ms 22.96% 22.96%     3490ms 23.16%  capnproto.org/go/capnp/v3.(*Message).canRead
    1350ms  8.96% 31.92%     3830ms 25.41%  runtime.mallocgc
    1070ms  7.10% 39.02%    13020ms 86.40%  capnproto.org/go/capnp/v3.Struct.Ptr
     980ms  6.50% 45.52%     1370ms  9.09%  fmt.(*pp).doPrintf
     780ms  5.18% 50.70%    11780ms 78.17%  capnproto.org/go/capnp/v3.(*Segment).readPtr
     710ms  4.71% 55.41%      710ms  4.71%  runtime.nextFreeFast (inline)
     610ms  4.05% 59.46%      670ms  4.45%  capnproto.org/go/capnp/v3.Ptr.text
     550ms  3.65% 63.11%      800ms  5.31%  runtime.heapBitsSetType
     510ms  3.38% 66.49%    14290ms 94.82%  github.com/pikrzysztof/capnp-benchmark.Book.TitleBytes
     340ms  2.26% 68.75%      340ms  2.26%  runtime.memmove

You can see the more goroutines we create, the slower the benchmark score, showing ReadLimiter as the main source of our pain.

ocdtrekkie · 2022-11-23T22:22:46Z

@pikrzysztof Is your benchmark repo set to private? (I am not going to run it, but since it 404s out, I figured I'd mention it to see if that was intentional or not.)

pikrzysztof · 2022-11-23T22:25:23Z

@ocdtrekkie thanks for noticing, fixed

zenhack · 2022-11-24T01:12:13Z

You can probably work around this for the moment by creating a separate message with the same backing arena:

msg2 := &capnp.Message{Arena: msg1.Arena, CapTable: msg1.CapTable}

...which will then have its own read limit and thus not have to contend with the other threads.

pikrzysztof · 2022-11-24T14:59:47Z

To illustrate scaling characteristics, I ran the benchmark on my server. I limited the benchmark to 14 threads only to avoid cross-socket chatter and hyperthreading from kicking in.

Here's the data fitted into universal scalability law curve:

x axis is the number of goroutines concurrently accessing capnp structures
y is the the benchmark throughput, i.e. number of accesses per second (note 1e6 multiplier)

The specific values are not very important, it's more about showing that this particular benchmark does not scale well beyond 4 cores, even hurting performance if you throw more than 12 cores at it.

and here's the same curve, extrapolated for the case if we scaled to 28 threads but without taking multiplesockets and hyperthreading into consideration.

Optimized parameters: [gamma = 4.68456920e+06, alpha = 4.68893905e-01, beta = 3.84146869e-03]
Estimated covariance: [[ 4.62262018e+10 8.83418242e+03 -2.47746327e+02]
[ 8.83418242e+03 2.04440998e-03 -8.13796353e-05]
[-2.47746327e+02 -8.13796353e-05 4.96395899e-06]]

zenhack · 2022-11-24T17:48:34Z

Yeah, that's more or less what I would expect. Have you tried the workaround I suggested above?

pikrzysztof · 2022-11-24T18:41:49Z

Yeah, that's more or less what I would expect. Have you tried the workaround I suggested above?

yes, I did try it, but the code crashes. Please see the code:

pikrzysztof/gocapnp@84da80e

please find the error message below:

Click to expand

-> % go test -bench=.
goos: linux
goarch: amd64
pkg: github.com/pikrzysztof/capnp-benchmark
cpu: 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz
BenchmarkOne-8               	15248163	      134.3 ns/op
BenchmarkTwo-8               	10032820	      209.6 ns/op
BenchmarkThree-8             	5946781	      257.8 ns/op
BenchmarkFour-8              	4530513	      320.1 ns/op
BenchmarkFive-8              	3031620	      391.8 ns/op
BenchmarkSix-8               	2592050	      462.1 ns/op
BenchmarkSeven-8             	2481145	      508.9 ns/op
BenchmarkEight-8             	1889131	      630.7 ns/op
BenchmarkNine-8              	1687453	      683.0 ns/op
BenchmarkTen-8               	1551306	      834.5 ns/op
BenchmarkEleven-8            	1220730	      895.1 ns/op
BenchmarkTwelve-8            	1253991	      964.9 ns/op
BenchmarkThirteen-8          	1000000	     1031 ns/op
BenchmarkFourteen-8          	1000000	     1115 ns/op
BenchmarkFifteen-8           	1000000	     1170 ns/op
BenchmarkSixteen-8           	 903970	     1256 ns/op
BenchmarkSeventeen-8         	 985557	     1296 ns/op
BenchmarkEighteen-8          	 844633	     1330 ns/op
BenchmarkNineteen-8          	 871327	     1441 ns/op
BenchmarkTwenty-8            	 763418	     1544 ns/op
BenchmarkTwentyOne-8         	 757081	     1563 ns/op
BenchmarkTwentyTwo-8         	 718891	     1632 ns/op
BenchmarkTwentyThree-8       	 663918	     1738 ns/op
BenchmarkTwentyFour-8        	 601849	     1794 ns/op
BenchmarkTwentyFive-8        	 581594	     2022 ns/op
BenchmarkTwentySix-8         	 593205	     2006 ns/op
BenchmarkTwentySeven-8       	 570664	     2034 ns/op
BenchmarkTwentyEight-8       	 587677	     2099 ns/op
BenchmarkOneFast-8           	panic: list element out of bounds

goroutine 1814 [running]:
capnproto.org/go/capnp/v3.List.primitiveElem({0x0, 0x0, 0x0, {0x0, 0x0}, 0x0, 0x0}, 0xc00013a0f0?, {0x0, 0x1})
	/home/kris/go/pkg/mod/capnproto.org/go/capnp/v3@v3.0.0-alpha.9/list.go:189 +0x18d
capnproto.org/go/capnp/v3.PointerList.primitiveElem(...)
	/home/kris/go/pkg/mod/capnproto.org/go/capnp/v3@v3.0.0-alpha.9/list-gen.go:226
capnproto.org/go/capnp/v3.PointerList.At({0x0, 0x0, 0x0, {0x0, 0x0}, 0x0, 0x0}, 0xe00000018?)
	/home/kris/go/pkg/mod/capnproto.org/go/capnp/v3@v3.0.0-alpha.9/list.go:350 +0xcb
capnproto.org/go/capnp/v3.(*Message).Root(0x0?)
	/home/kris/go/pkg/mod/capnproto.org/go/capnp/v3@v3.0.0-alpha.9/message.go:185 +0xf6
github.com/pikrzysztof/capnp-benchmark.ReadRootBook(0x380?)
	/home/kris/dev/org/chfagg-cpuusage/benchmark/books.go:27 +0x3d
github.com/pikrzysztof/capnp-benchmark.spinGoroutinesFast(0x1, 0xc000152240)
	/home/kris/dev/org/chfagg-cpuusage/benchmark/main_test.go:59 +0x1b5
github.com/pikrzysztof/capnp-benchmark.BenchmarkOneFast(0xc000152240?)
	/home/kris/dev/org/chfagg-cpuusage/benchmark/main_test.go:164 +0x25
testing.(*B).runN(0xc000152240, 0x1)
	/usr/lib/go/src/testing/benchmark.go:193 +0x102
testing.(*B).run1.func1()
	/usr/lib/go/src/testing/benchmark.go:233 +0x59
created by testing.(*B).run1
	/usr/lib/go/src/testing/benchmark.go:226 +0x9c
exit status 2
FAIL	github.com/pikrzysztof/capnp-benchmark	42.847s

pikrzysztof · 2022-12-23T18:09:05Z

@zenhack I have investigated the crash and it looks like the first segment in the cloned message is unset

(dlv) print msg1
*capnproto.org/go/capnp/v3.Message {
	rlimit: 9223372036854775808,
	rlimitInit: sync.Once {
		done: 1,
		m: (*sync.Mutex)(0xc00013a09c),},
	Arena: capnproto.org/go/capnp/v3.Arena(*capnproto.org/go/capnp/v3.SingleSegmentArena) *[],
	CapTable: []capnproto.org/go/capnp/v3.Client len: 0, cap: 0, nil,
	TraverseLimit: 0,
	DepthLimit: 0,
	mu: sync.Mutex {state: 0, sema: 0},
	segs: map[capnproto.org/go/capnp/v3.SegmentID]*capnproto.org/go/capnp/v3.Segment [
		0: *(*"capnproto.org/go/capnp/v3.Segment")(0xc00013a0f0),
	],
	firstSeg: capnproto.org/go/capnp/v3.Segment {
		msg: *(*"capnproto.org/go/capnp/v3.Message")(0xc00013a090),
		id: 0,
		data: []uint8 len: 40, cap: 1024, [0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,114,0,0,0,87,97,114,32,97,110,100,32,80,101,97,99,101,0,0,0],},}
(dlv) print msg2
*capnproto.org/go/capnp/v3.Message {
	rlimit: 0,
	rlimitInit: sync.Once {
		done: 0,
		m: (*sync.Mutex)(0xc00013a12c),},
	Arena: capnproto.org/go/capnp/v3.Arena(*capnproto.org/go/capnp/v3.SingleSegmentArena) *[],
	CapTable: []capnproto.org/go/capnp/v3.Client len: 0, cap: 0, nil,
	TraverseLimit: 0,
	DepthLimit: 0,
	mu: sync.Mutex {state: 0, sema: 0},
	segs: map[capnproto.org/go/capnp/v3.SegmentID]*capnproto.org/go/capnp/v3.Segment nil,
	firstSeg: capnproto.org/go/capnp/v3.Segment {
		msg: *capnproto.org/go/capnp/v3.Message nil,
		id: 0,
		data: []uint8 len: 0, cap: 0, nil,},}

running

(dlv) set msg2.firstSeg =msg1.firstSeg

allows me to read the cloned message without any problems. AIUI the issue is that Message.setSegment is private so I can't do this trick outside of the debugger.

zenhack · 2022-12-24T05:46:08Z

Ah, that makes sense. Maybe instead try:

msg2, _, _ := NewMessage(msg1.Arena)
msg2.CapTable = msg1.CapTable

...which should properly initialize firstSeg.

pikrzysztof · 2022-12-24T13:12:04Z

...which should properly initialize firstSeg.

Yes, firstSeg is initialized, but does not have the same data as the original message.

firstSeg gets initialized to an array of zeroes:

(dlv) p msg2
*capnproto.org/go/capnp/v3.Message {
	rlimit: 9223372036854775808,
	rlimitInit: sync.Once {
		done: 1,
		m: (*sync.Mutex)(0xc0001da12c),},
	Arena: capnproto.org/go/capnp/v3.Arena(*capnproto.org/go/capnp/v3.SingleSegmentArena) *[],
	CapTable: []capnproto.org/go/capnp/v3.Client len: 0, cap: 0, nil,
	TraverseLimit: 0,
	DepthLimit: 0,
	mu: sync.Mutex {state: 0, sema: 0},
	segs: map[capnproto.org/go/capnp/v3.SegmentID]*capnproto.org/go/capnp/v3.Segment nil,
	firstSeg: capnproto.org/go/capnp/v3.Segment {
		msg: *(*"capnproto.org/go/capnp/v3.Message")(0xc0001da120),
		id: 0,
		data: []uint8 len: 8, cap: 1024, [0,0,0,0,0,0,0,0],},}

but the original object has the book title stored in there:

(dlv) p msg1
*capnproto.org/go/capnp/v3.Message {
	rlimit: 9223372036854775794,
	rlimitInit: sync.Once {
		done: 1,
		m: (*sync.Mutex)(0xc0001da09c),},
	Arena: capnproto.org/go/capnp/v3.Arena(*capnproto.org/go/capnp/v3.SingleSegmentArena) *[],
	CapTable: []capnproto.org/go/capnp/v3.Client len: 0, cap: 0, nil,
	TraverseLimit: 0,
	DepthLimit: 0,
	mu: sync.Mutex {state: 0, sema: 0},
	segs: map[capnproto.org/go/capnp/v3.SegmentID]*capnproto.org/go/capnp/v3.Segment [
		0: *(*"capnproto.org/go/capnp/v3.Segment")(0xc0001da0f0),
	],
	firstSeg: capnproto.org/go/capnp/v3.Segment {
		msg: *(*"capnproto.org/go/capnp/v3.Message")(0xc0001da090),
		id: 0,
		data: []uint8 len: 40, cap: 1024, [0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,114,0,0,0,87,97,114,32,97,110,100,32,80,101,97,99,101,0,0,0],},}

which means that the new Book object has an empty title.

I don't think focusing on this particular workaround is a an efficient way of spending our time - let's focus on fixing the general use-case. My application is currently working with forked capnp which does not have ReadLimiter in performance-critical sections.

zenhack · 2022-12-24T19:30:17Z

Yeah, probably not worth continuing to fight with it.

My current thinking is to add a pointer to the limiter as a field of each relevant datatype (Struct, Ptr, etc) along with a method:

func (Struct) WithReadLimit(rlimit *uint64) Struct

...and then stop using atomic operations on the read limiter, instead documenting that you need to use these methods to get a shallow copy if you're going to access a message from multiple threads.

@lthibault, thoughts?

Also tangentially related, @zombiezen, I have some questions re: the Message struct:

What was the rationale behind firstSeg, vs. just accessing segment 0 through segs/arena? Was this result of some performance analysis?
What is the segs map doing for us? It seems like we could instead take Segment rather than *Segment most places and then just re-create the segment struct after calling Arena.Data.

zenhack · 2022-12-24T19:40:52Z

The other thing I am tempted to do, but also think this is maybe way to invasive: just take the read limit as a parameter to accessor methods, instead of storing it in the data types at all. e.g.

func (Struct) Ptr(rlim *uint64, i uint16) (Ptr, error)

it'd be someone analogous to a context.Context in that way.

zombiezen · 2022-12-24T21:14:06Z

What was the rationale behind firstSeg, vs. just accessing segment 0 through segs/arena? Was this result of some performance analysis?

My recollection is that it was to reduce allocations when reading a single-segment message: fb3c35a

And I believe 1229c29 and 8dfabb4 have the benchmarks that I was using to justify that.

What is the segs map doing for us? It seems like we could instead take Segment rather than *Segment most places and then just re-create the segment struct after calling Arena.Data.

The intent was to preserve the identity of the segment once it is introduced, so msg.Segment(0) == msg.Segment(0). A different API may be justified.

zenhack · 2022-12-24T21:37:17Z

The intent was to preserve the identity of the segment once it is introduced, so msg.Segment(0) == msg.Segment(0). A different API may be justified.

I suppose we could add an .Is method or the like

zombiezen · 2022-12-24T21:40:54Z

I don't know that it's super-important to perform the equality, but I definitely didn't want to be allocating a new segment object on the heap every time the method was called.

zenhack · 2022-12-24T21:41:34Z

Yeah, my thinking is if we can kill the pointer indirection it can just be pass by value and not allocate.

lthibault · 2022-12-25T23:43:28Z

Yeah, probably not worth continuing to fight with it.

My current thinking is to add a pointer to the limiter as a field of each relevant datatype (Struct, Ptr, etc) along with a method:
func (Struct) WithReadLimit(rlimit *uint64) Struct
...and then stop using atomic operations on the read limiter, instead documenting that you need to use these methods to get a shallow copy if you're going to access a message from multiple threads.

@lthibault, thoughts?

Sounds reasonable to me 👍

lthibault added the performance label Nov 23, 2022

zamkot mentioned this issue Nov 23, 2022

MarkSafe kentik/go-capnproto2#2

Merged

lthibault added this to the 3.0 milestone Nov 23, 2022

vito mentioned this issue Sep 11, 2023

Save & load pipelines dagger/dagger#3923

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Poor multithreaded performance #343

Poor multithreaded performance #343

pikrzysztof commented Nov 23, 2022 •

edited

lthibault commented Nov 23, 2022 •

edited

pikrzysztof commented Nov 23, 2022 •

edited

ocdtrekkie commented Nov 23, 2022

lthibault commented Nov 23, 2022

pikrzysztof commented Nov 23, 2022

lthibault commented Nov 23, 2022

zenhack commented Nov 23, 2022

pikrzysztof commented Nov 23, 2022 •

edited

ocdtrekkie commented Nov 23, 2022

pikrzysztof commented Nov 23, 2022

zenhack commented Nov 24, 2022

pikrzysztof commented Nov 24, 2022

zenhack commented Nov 24, 2022

pikrzysztof commented Nov 24, 2022 •

edited

pikrzysztof commented Dec 23, 2022

zenhack commented Dec 24, 2022

pikrzysztof commented Dec 24, 2022

zenhack commented Dec 24, 2022

zenhack commented Dec 24, 2022

zombiezen commented Dec 24, 2022

zenhack commented Dec 24, 2022

zombiezen commented Dec 24, 2022

zenhack commented Dec 24, 2022

lthibault commented Dec 25, 2022

Poor multithreaded performance #343

Poor multithreaded performance #343

Comments

pikrzysztof commented Nov 23, 2022 • edited

lthibault commented Nov 23, 2022 • edited

pikrzysztof commented Nov 23, 2022 • edited

ocdtrekkie commented Nov 23, 2022

lthibault commented Nov 23, 2022

pikrzysztof commented Nov 23, 2022

lthibault commented Nov 23, 2022

zenhack commented Nov 23, 2022

pikrzysztof commented Nov 23, 2022 • edited

ocdtrekkie commented Nov 23, 2022

pikrzysztof commented Nov 23, 2022

zenhack commented Nov 24, 2022

pikrzysztof commented Nov 24, 2022

zenhack commented Nov 24, 2022

pikrzysztof commented Nov 24, 2022 • edited

pikrzysztof commented Dec 23, 2022

zenhack commented Dec 24, 2022

pikrzysztof commented Dec 24, 2022

zenhack commented Dec 24, 2022

zenhack commented Dec 24, 2022

zombiezen commented Dec 24, 2022

zenhack commented Dec 24, 2022

zombiezen commented Dec 24, 2022

zenhack commented Dec 24, 2022

lthibault commented Dec 25, 2022

pikrzysztof commented Nov 23, 2022 •

edited

lthibault commented Nov 23, 2022 •

edited

pikrzysztof commented Nov 23, 2022 •

edited

pikrzysztof commented Nov 23, 2022 •

edited

pikrzysztof commented Nov 24, 2022 •

edited