Help with profiling OSX M1 #142

cjappl · 2023-02-26T16:41:23Z

Hi!

I saw

Note that it's only been tested on x86(-64); if someone has access to other processors I'd love to run some tests on anything that's not x86-based.

In the readme.

I have a mac m1 (arm) processor. If there is a suite of tests I can run for you I'd be thrilled to help.

Let me know

cameron314 · 2023-02-27T04:21:29Z

Neat, thanks! Do the tests and benchmarks compile using the makefiles in those directories? (May need to replace g++ with clang++.)

cjappl · 2023-02-27T17:52:10Z

Compiled with no issue, running says

Note: SPSC queue not supported on this platform, discount its timings

Full output:

> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.41x
Raw remove        | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove  | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded   | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add        | 0.0029s | 0.0120s | 0.0000s | 0.0641s | 0.0029s | 0.0120s | 0.0000s | 0.0735s | 0.0029s | 0.0120s | 0.0000s | 0.0685s | 0.00x | 23.68x
Mostly remove     | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.00x | 0.98x
Heavy concurrent  | 0.0025s | 0.0143s | 0.0000s | 0.0399s | 0.0025s | 0.0148s | 0.0000s | 0.0428s | 0.0025s | 0.0145s | 0.0000s | 0.0417s | 0.00x | 16.61x
Random concurrent | 0.0061s | 0.0083s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.00x | 1.08x

Average ops/s:
    ReaderWriterQueue:                  371.80 million
    BlockingReaderWriterCircularBuffer: 481.26 million
    SPSC queue:                         0.00 million
    Folly queue:                        513.61 million

I modified the code to allow me to run, but crash when running in -O3:

> lldb benchmarks
(lldb) target create "benchmarks"
Current executable set to '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64).
(lldb) run
Process 69685 launched: '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64)
Process 69685 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
   79     {
   80         if (load_consume(&tail_->next_))
   81         {
-> 82             v = tail_->next_->value_;
   83             store_release(&tail_, tail_->next_);
   84             return true;
   85         }
Target 0: (benchmarks) stopped.
warning: benchmarks was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
    frame #1: 0x000000010000a518 benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) at bench.cpp:373:7 [opt]
    frame #2: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] void details::ArgWrapper<void, void, void>::callCallback<double runBenchmark<spsc_queue<int> >(this=0x0000600000201290, callback=0x0000600000201280)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) const at simplethread.h:49:80 [opt]
    frame #3: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(wrapper=0x0000600000201280) at simplethread.h:71:23 [opt]
    frame #4: 0x000000010000b2f0 benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] decltype(__f=<unavailable>, __args=<unavailable>)(static_cast<SimpleThread::ThreadRef*>(fp0))) std::__1::__invoke<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&&) at type_traits:3918:1 [opt]
    frame #5: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*, 2ul>(__t=<unavailable>, (null)=<unavailable>)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>&, std::__1::__tuple_indices<2ul>) at thread:287:5 [opt]
    frame #6: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(__vp=0x0000600000201220) at thread:298:5 [opt]
    frame #7: 0x00000001a46b826c libsystem_pthread.dylib`_pthread_start + 148
(lldb) (lldb)

The specific crashing frame more up close

* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
   79     {
   80         if (load_consume(&tail_->next_))
   81         {
-> 82             v = tail_->next_->value_;
   83             store_release(&tail_, tail_->next_);
   84             return true;
   85         }

This does not happen when compiling the benchmarks with -O0. Let me see if I can poke around and find anything satisfying

cjappl · 2023-02-27T17:55:12Z

Some output from ubsan and asan:

bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
ext/1024cores/spscqueue.h:82:29: runtime error: member access within null pointer of type 'spsc_queue<int>::node'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ext/1024cores/spscqueue.h:82:29 in 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==70167==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000008 (pc 0x0001006461c8 bp 0x00016f85ee80 sp 0x00016f85ee40 T501)
==70167==The signal is caused by a UNKNOWN memory access.
==70167==Hint: address points to the zero page.
    #0 0x1006461c8 in spsc_queue<int>::try_dequeue(int&) spscqueue.h:82
    #1 0x100647b1c in SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) simplethread.h:71
    #2 0x10064fa14 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) thread:298
    #3 0x1a46b8268 in _pthread_start+0x90 (libsystem_pthread.dylib:arm64e+0x7268)
    #4 0x1a46b3088 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x2088)

==70167==Register values:
 x[0] = 0x000000016f85ed58   x[1] = 0x000000016f7d2d70   x[2] = 0x0000000000000000   x[3] = 0x0000000135f007a0  
 x[4] = 0x0000000063000000   x[5] = 0x0000000000000000   x[6] = 0x0000000000000000   x[7] = 0x0000000000000000  
 x[8] = 0x000000016f7d2cf0   x[9] = 0x0000000000000008  x[10] = 0x0000007000020000  x[11] = 0x000000002defa59e  
x[12] = 0x0000000126816310  x[13] = 0x0000000000000000  x[14] = 0x0000000000000000  x[15] = 0x0000000000000000  
x[16] = 0x0000000301413400  x[17] = 0x0000000100c4c0a8  x[18] = 0x0000000000000000  x[19] = 0x0000000102b02d70  
x[20] = 0x0000000102b02d78  x[21] = 0x00000001006624b0  x[22] = 0x0000000100662490  x[23] = 0x0000007000020000  
x[24] = 0x0000000000000000  x[25] = 0x0000000000124ddb  x[26] = 0x00000000205605ae  x[27] = 0x00000000205605af  
x[28] = 0x0000000000000000     fp = 0x000000016f85ee80     lr = 0x00000001006463f8     sp = 0x000000016f85ee40  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV spscqueue.h:82 in spsc_queue<int>::try_dequeue(int&)
Thread T501 created by T0 here:
    #0 0x100bf8c5c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x38c5c)
    #1 0x10064f46c in std::__1::thread::thread<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&, void>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&) thread:314
    #2 0x10064eb64 in SimpleThread::startThread(void*, void (*)(void*)) simplethread.cpp:65
    #3 0x1006479f4 in SimpleThread::SimpleThread<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) simplethread.h:115
    #4 0x100635f94 in double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&) bench.cpp:371
    #5 0x10062bc2c in main bench.cpp:124
    #6 0x1006e1088 in start+0x204 (dyld:arm64e+0x5088)

==70167==ABORTING

cameron314 · 2023-02-27T17:55:35Z

Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.

cameron314 · 2023-02-27T17:56:32Z

asan is also known to give false positives with lock-free code.

cameron314 · 2023-02-27T17:57:26Z

Curious where those ubsan overflows are from though.

cjappl · 2023-02-27T17:58:23Z

The ubsan overflows are pretty easy to fix, just changing:

		unsigned long long total = 0;

on line 270 of the bench file fixed it up (long long may be overkill, but this silenced it)

cjappl · 2023-02-27T17:59:42Z

Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.

Does this mean MPMC (concurrentqueue) is portable? Should I try and run some benchmarks on that? Happy to help in any way I can.

cameron314 · 2023-02-27T18:00:06Z

Ah I see, it's a dummy variable just to disable dead code removal. Yeah switching to any unsigned type should fix it.

cameron314 · 2023-02-27T18:01:26Z

"spsc" here in the benchmarks is not my code, but a different SPSC queue implementation from https://www.1024cores.net/home/lock-free-algorithms/queues/unbounded-spsc-queue just for comparison.

cameron314 · 2023-02-27T18:03:41Z

Would be worth getting the benchmarks to run in O3 (without spsc) to see the results.

cjappl · 2023-02-27T18:03:46Z

Ah, got it! Sorry, I thought that was referring to the code in this repo (readerwriterqueue)

Here are the afformentioned results, now that we know SPSC is dead and we don't care about it:

> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.42x
Raw remove        | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove  | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded   | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add        | 0.0029s | 0.0119s | 0.0000s | 0.0373s | 0.0029s | 0.0120s | 0.0000s | 0.0594s | 0.0029s | 0.0120s | 0.0000s | 0.0487s | 0.00x | 16.86x
Mostly remove     | 0.0027s | 0.0035s | 0.0000s | 0.0026s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.00x | 0.97x
Heavy concurrent  | 0.0025s | 0.0151s | 0.0000s | 0.0415s | 0.0025s | 0.0153s | 0.0000s | 0.0432s | 0.0025s | 0.0152s | 0.0000s | 0.0427s | 0.00x | 17.09x
Random concurrent | 0.0061s | 0.0082s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0082s | 0.0000s | 0.0067s | 0.00x | 1.08x

Average ops/s:
    ReaderWriterQueue:                  363.73 million
    BlockingReaderWriterCircularBuffer: 473.82 million
    SPSC queue:                         0.00 million
    Folly queue:                        509.72 million

cameron314 · 2023-02-27T18:04:15Z

That's O3?

cjappl · 2023-02-27T18:05:24Z

Yep, this is no changes to the repo, just tip of main. the compile line being used is:

benchmarks$(EXT): bench.cpp ../readerwriterqueue.h ../readerwritercircularbuffer.h ../atomicops.h ext/1024cores/spscqueue.h ext/folly/ProducerConsumerQueue.h ../tests/common/simplethread.h ../tests/common/simplethread.cpp systemtime.h systemtime.cpp makefile
	g++ -std=c++11 -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks$(EXT) -pthread $(PLATFORM_OPTS)

cjappl · 2023-02-27T18:06:18Z

(to answer your previous concern about clang++ macosx "helpfully" aliases g++ to clang++)

> g++
clang: error: no input files

cameron314 · 2023-02-27T18:09:38Z

Nice. Thanks for the results. Would have to dig into the assembly to see why the base operations with Folly are twice as fast.

cjappl · 2023-02-27T18:10:39Z

Sounds good. Please don't hesitate to reach out if I can assist. This project has helped me a ton so happy to contribute in any way I can.

cjappl changed the title ~~Help with profiling~~ Help with profiling OSX M1 Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help with profiling OSX M1 #142

Help with profiling OSX M1 #142

cjappl commented Feb 26, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cameron314 commented Feb 27, 2023 •

edited

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

Help with profiling OSX M1 #142

Help with profiling OSX M1 #142

Comments

cjappl commented Feb 26, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cameron314 commented Feb 27, 2023 • edited

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023

cjappl commented Feb 27, 2023

cameron314 commented Feb 27, 2023 •

edited