Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Help with profiling OSX M1 #142

Open
cjappl opened this issue Feb 26, 2023 · 17 comments
Open

Help with profiling OSX M1 #142

cjappl opened this issue Feb 26, 2023 · 17 comments

Comments

@cjappl
Copy link
Contributor

cjappl commented Feb 26, 2023

Hi!

I saw

Note that it's only been tested on x86(-64); if someone has access to other processors I'd love to run some tests on anything that's not x86-based.

In the readme.

I have a mac m1 (arm) processor. If there is a suite of tests I can run for you I'd be thrilled to help.

Let me know

@cameron314
Copy link
Owner

Neat, thanks! Do the tests and benchmarks compile using the makefiles in those directories? (May need to replace g++ with clang++.)

@cjappl cjappl changed the title Help with profiling Help with profiling OSX M1 Feb 27, 2023
@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

Compiled with no issue, running says

Note: SPSC queue not supported on this platform, discount its timings

Full output:

> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.41x
Raw remove        | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove  | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded   | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add        | 0.0029s | 0.0120s | 0.0000s | 0.0641s | 0.0029s | 0.0120s | 0.0000s | 0.0735s | 0.0029s | 0.0120s | 0.0000s | 0.0685s | 0.00x | 23.68x
Mostly remove     | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.0027s | 0.0032s | 0.0000s | 0.0027s | 0.00x | 0.98x
Heavy concurrent  | 0.0025s | 0.0143s | 0.0000s | 0.0399s | 0.0025s | 0.0148s | 0.0000s | 0.0428s | 0.0025s | 0.0145s | 0.0000s | 0.0417s | 0.00x | 16.61x
Random concurrent | 0.0061s | 0.0083s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.00x | 1.08x

Average ops/s:
    ReaderWriterQueue:                  371.80 million
    BlockingReaderWriterCircularBuffer: 481.26 million
    SPSC queue:                         0.00 million
    Folly queue:                        513.61 million

I modified the code to allow me to run, but crash when running in -O3:

Screen Shot 2023-02-27 at 9 49 37 AM

> lldb benchmarks
(lldb) target create "benchmarks"
Current executable set to '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64).
(lldb) run
Process 69685 launched: '/Users/chrisapple/code/readerwriterqueue/benchmarks/benchmarks' (arm64)
Process 69685 stopped
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
   79     {
   80         if (load_consume(&tail_->next_))
   81         {
-> 82             v = tail_->next_->value_;
   83             store_release(&tail_, tail_->next_);
   84             return true;
   85         }
Target 0: (benchmarks) stopped.
warning: benchmarks was compiled with optimization - stepping may behave oddly; variables may not be available.
(lldb) bt
* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
  * frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
    frame #1: 0x000000010000a518 benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) at bench.cpp:373:7 [opt]
    frame #2: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] void details::ArgWrapper<void, void, void>::callCallback<double runBenchmark<spsc_queue<int> >(this=0x0000600000201290, callback=0x0000600000201280)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) const at simplethread.h:49:80 [opt]
    frame #3: 0x000000010000a50c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(wrapper=0x0000600000201280) at simplethread.h:71:23 [opt]
    frame #4: 0x000000010000b2f0 benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] decltype(__f=<unavailable>, __args=<unavailable>)(static_cast<SimpleThread::ThreadRef*>(fp0))) std::__1::__invoke<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&&) at type_traits:3918:1 [opt]
    frame #5: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) [inlined] void std::__1::__thread_execute<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*, 2ul>(__t=<unavailable>, (null)=<unavailable>)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*>&, std::__1::__tuple_indices<2ul>) at thread:287:5 [opt]
    frame #6: 0x000000010000b2ec benchmarks`void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(__vp=0x0000600000201220) at thread:298:5 [opt]
    frame #7: 0x00000001a46b826c libsystem_pthread.dylib`_pthread_start + 148
(lldb) (lldb)

The specific crashing frame more up close

* thread #2, stop reason = EXC_BAD_ACCESS (code=1, address=0x8)
    frame #0: 0x000000010000a52c benchmarks`SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) [inlined] spsc_queue<int>::try_dequeue(this=0x000000016fdfb7d8, v=0x000000016fdfb7b4) at spscqueue.h:82:29 [opt]
   79     {
   80         if (load_consume(&tail_->next_))
   81         {
-> 82             v = tail_->next_->value_;
   83             store_release(&tail_, tail_->next_);
   84             return true;
   85         }

This does not happen when compiling the benchmarks with -O0. Let me see if I can poke around and find anything satisfying

@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

Some output from ubsan and asan:

bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
bench.cpp:277:10: runtime error: signed integer overflow: 2147450880 + 65536 cannot be represented in type 'int'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior bench.cpp:277:10 in 
ext/1024cores/spscqueue.h:82:29: runtime error: member access within null pointer of type 'spsc_queue<int>::node'
SUMMARY: UndefinedBehaviorSanitizer: undefined-behavior ext/1024cores/spscqueue.h:82:29 in 
AddressSanitizer:DEADLYSIGNAL
=================================================================
==70167==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000008 (pc 0x0001006461c8 bp 0x00016f85ee80 sp 0x00016f85ee40 T501)
==70167==The signal is caused by a UNKNOWN memory access.
==70167==Hint: address points to the zero page.
    #0 0x1006461c8 in spsc_queue<int>::try_dequeue(int&) spscqueue.h:82
    #1 0x100647b1c in SimpleThread::CallbackWrapper<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'(), details::ArgWrapper<void, void, void> >::callAndDelete(void*) simplethread.h:71
    #2 0x10064fa14 in void* std::__1::__thread_proxy<std::__1::tuple<std::__1::unique_ptr<std::__1::__thread_struct, std::__1::default_delete<std::__1::__thread_struct> >, void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*> >(void*) thread:298
    #3 0x1a46b8268 in _pthread_start+0x90 (libsystem_pthread.dylib:arm64e+0x7268)
    #4 0x1a46b3088 in thread_start+0x4 (libsystem_pthread.dylib:arm64e+0x2088)

==70167==Register values:
 x[0] = 0x000000016f85ed58   x[1] = 0x000000016f7d2d70   x[2] = 0x0000000000000000   x[3] = 0x0000000135f007a0  
 x[4] = 0x0000000063000000   x[5] = 0x0000000000000000   x[6] = 0x0000000000000000   x[7] = 0x0000000000000000  
 x[8] = 0x000000016f7d2cf0   x[9] = 0x0000000000000008  x[10] = 0x0000007000020000  x[11] = 0x000000002defa59e  
x[12] = 0x0000000126816310  x[13] = 0x0000000000000000  x[14] = 0x0000000000000000  x[15] = 0x0000000000000000  
x[16] = 0x0000000301413400  x[17] = 0x0000000100c4c0a8  x[18] = 0x0000000000000000  x[19] = 0x0000000102b02d70  
x[20] = 0x0000000102b02d78  x[21] = 0x00000001006624b0  x[22] = 0x0000000100662490  x[23] = 0x0000007000020000  
x[24] = 0x0000000000000000  x[25] = 0x0000000000124ddb  x[26] = 0x00000000205605ae  x[27] = 0x00000000205605af  
x[28] = 0x0000000000000000     fp = 0x000000016f85ee80     lr = 0x00000001006463f8     sp = 0x000000016f85ee40  
AddressSanitizer can not provide additional info.
SUMMARY: AddressSanitizer: SEGV spscqueue.h:82 in spsc_queue<int>::try_dequeue(int&)
Thread T501 created by T0 here:
    #0 0x100bf8c5c in wrap_pthread_create+0x54 (libclang_rt.asan_osx_dynamic.dylib:arm64e+0x38c5c)
    #1 0x10064f46c in std::__1::thread::thread<void (*)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&, void>(void (*&&)(SimpleThread::ThreadRef*), SimpleThread::ThreadRef*&) thread:314
    #2 0x10064eb64 in SimpleThread::startThread(void*, void (*)(void*)) simplethread.cpp:65
    #3 0x1006479f4 in SimpleThread::SimpleThread<double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()>(double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&)::'lambda3'()&&) simplethread.h:115
    #4 0x100635f94 in double runBenchmark<spsc_queue<int> >(BenchmarkType, unsigned int, double&) bench.cpp:371
    #5 0x10062bc2c in main bench.cpp:124
    #6 0x1006e1088 in start+0x204 (dyld:arm64e+0x5088)

==70167==ABORTING

@cameron314
Copy link
Owner

Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.

@cameron314
Copy link
Owner

asan is also known to give false positives with lock-free code.

@cameron314
Copy link
Owner

Curious where those ubsan overflows are from though.

@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

The ubsan overflows are pretty easy to fix, just changing:

		unsigned long long total = 0;

on line 270 of the bench file fixed it up (long long may be overkill, but this silenced it)

@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

Don't waste your time, the spsc implementation depends specifically on x86 semantics and is not portable. I guess we'd have to disable it completely to be able to run the benchmarks in O3.

Does this mean MPMC (concurrentqueue) is portable? Should I try and run some benchmarks on that? Happy to help in any way I can.

@cameron314
Copy link
Owner

Ah I see, it's a dummy variable just to disable dead code removal. Yeah switching to any unsigned type should fix it.

@cameron314
Copy link
Owner

cameron314 commented Feb 27, 2023

"spsc" here in the benchmarks is not my code, but a different SPSC queue implementation from https://www.1024cores.net/home/lock-free-algorithms/queues/unbounded-spsc-queue just for comparison.

@cameron314
Copy link
Owner

Would be worth getting the benchmarks to run in O3 (without spsc) to see the results.

@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

Ah, got it! Sorry, I thought that was referring to the code in this repo (readerwriterqueue)

Here are the afformentioned results, now that we know SPSC is dead and we don't care about it:

> ./benchmarks
Note: SPSC queue not supported on this platform, discount its timings
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.0003s | 0.0009s | 0.0000s | 0.0001s | 0.00x | 0.42x
Raw remove        | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.0003s | 0.0007s | 0.0000s | 0.0001s | 0.00x | 0.28x
Raw empty remove  | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.0021s | 0.0007s | 0.0000s | 0.0012s | 0.00x | 0.59x
Single-threaded   | 0.0028s | 0.0029s | 0.0000s | 0.0014s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.0028s | 0.0029s | 0.0000s | 0.0015s | 0.00x | 0.51x
Mostly add        | 0.0029s | 0.0119s | 0.0000s | 0.0373s | 0.0029s | 0.0120s | 0.0000s | 0.0594s | 0.0029s | 0.0120s | 0.0000s | 0.0487s | 0.00x | 16.86x
Mostly remove     | 0.0027s | 0.0035s | 0.0000s | 0.0026s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.0027s | 0.0035s | 0.0000s | 0.0027s | 0.00x | 0.97x
Heavy concurrent  | 0.0025s | 0.0151s | 0.0000s | 0.0415s | 0.0025s | 0.0153s | 0.0000s | 0.0432s | 0.0025s | 0.0152s | 0.0000s | 0.0427s | 0.00x | 17.09x
Random concurrent | 0.0061s | 0.0082s | 0.0000s | 0.0066s | 0.0061s | 0.0083s | 0.0000s | 0.0067s | 0.0061s | 0.0082s | 0.0000s | 0.0067s | 0.00x | 1.08x

Average ops/s:
    ReaderWriterQueue:                  363.73 million
    BlockingReaderWriterCircularBuffer: 473.82 million
    SPSC queue:                         0.00 million
    Folly queue:                        509.72 million

@cameron314
Copy link
Owner

That's O3?

@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

Yep, this is no changes to the repo, just tip of main. the compile line being used is:

benchmarks$(EXT): bench.cpp ../readerwriterqueue.h ../readerwritercircularbuffer.h ../atomicops.h ext/1024cores/spscqueue.h ext/folly/ProducerConsumerQueue.h ../tests/common/simplethread.h ../tests/common/simplethread.cpp systemtime.h systemtime.cpp makefile
	g++ -std=c++11 -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks$(EXT) -pthread $(PLATFORM_OPTS)

@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

(to answer your previous concern about clang++ macosx "helpfully" aliases g++ to clang++)

> g++
clang: error: no input files

@cameron314
Copy link
Owner

Nice. Thanks for the results. Would have to dig into the assembly to see why the base operations with Folly are twice as fast.

@cjappl
Copy link
Contributor Author

cjappl commented Feb 27, 2023

Sounds good. Please don't hesitate to reach out if I can assist. This project has helped me a ton so happy to contribute in any way I can.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants