Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clang performance #119

Open
mipac opened this issue Apr 20, 2021 · 3 comments
Open

clang performance #119

mipac opened this issue Apr 20, 2021 · 3 comments

Comments

@mipac
Copy link

mipac commented Apr 20, 2021

on x86_64 I bench gcc9 and clang8/9 and I see that clang has poor performance
Do you know that behaviour? Is there some options to add in the clang case?

clang++-9 -std=c++11    -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks -pthread -Wl,--no-as-needed -lrt

$ ./benchmarks 
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0005s | 0.0011s | 0.0002s | 0.0001s | 0.0005s | 0.0011s | 0.0002s | 0.0001s | 0.0005s | 0.0011s | 0.0002s | 0.0001s | 0.43x | 0.17x
Raw remove        | 0.0002s | 0.0010s | 0.0003s | 0.0001s | 0.0002s | 0.0010s | 0.0003s | 0.0001s | 0.0002s | 0.0010s | 0.0003s | 0.0001s | 1.97x | 0.41x
Raw empty remove  | 0.0027s | 0.0011s | 0.0016s | 0.0015s | 0.0027s | 0.0012s | 0.0017s | 0.0016s | 0.0027s | 0.0012s | 0.0017s | 0.0016s | 0.62x | 0.59x
Single-threaded   | 0.0043s | 0.0047s | 0.0039s | 0.0038s | 0.0043s | 0.0047s | 0.0039s | 0.0039s | 0.0043s | 0.0047s | 0.0039s | 0.0039s | 0.91x | 0.90x
Mostly add        | 0.0067s | 0.0176s | 0.0053s | 0.0056s | 0.0068s | 0.0195s | 0.0060s | 0.0057s | 0.0068s | 0.0187s | 0.0058s | 0.0057s | 0.85x | 0.84x
Mostly remove     | 0.0041s | 0.0059s | 0.0038s | 0.0043s | 0.0042s | 0.0060s | 0.0040s | 0.0044s | 0.0042s | 0.0059s | 0.0039s | 0.0043s | 0.93x | 1.03x
Heavy concurrent  | 0.0092s | 0.0171s | 0.0046s | 0.0044s | 0.0263s | 0.0721s | 0.0047s | 0.0079s | 0.0179s | 0.0557s | 0.0047s | 0.0069s | 0.26x | 0.39x
Random concurrent | 0.0103s | 0.0133s | 0.0101s | 0.0103s | 0.0103s | 0.0135s | 0.0101s | 0.0104s | 0.0103s | 0.0134s | 0.0101s | 0.0103s | 0.98x | 1.00x

Average ops/s:
    ReaderWriterQueue:                  260.27 million
    BlockingReaderWriterCircularBuffer: 275.78 million
    SPSC queue:                         295.60 million
    Folly queue:                        562.96 million


g++ -std=c++11    -Wpedantic -Wall -DNDEBUG -O3 -g bench.cpp ../tests/common/simplethread.cpp systemtime.cpp -o benchmarks -pthread -Wl,--no-as-needed -lrt

$ ./benchmarks 
                  |----------------  Min -----------------|----------------- Max -----------------|----------------- Avg -----------------|
Benchmark         |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  |   RWQ   |  BRWCB  |  SPSC   |  Folly  | xSPSC | xFolly
------------------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+---------+-------+-------
Raw add           | 0.0001s | 0.0013s | 0.0002s | 0.0002s | 0.0001s | 0.0013s | 0.0003s | 0.0002s | 0.0001s | 0.0013s | 0.0003s | 0.0002s | 1.76x | 1.10x
Raw remove        | 0.0002s | 0.0010s | 0.0002s | 0.0002s | 0.0002s | 0.0010s | 0.0003s | 0.0002s | 0.0002s | 0.0010s | 0.0003s | 0.0002s | 1.63x | 1.24x
Raw empty remove  | 0.0022s | 0.0009s | 0.0016s | 0.0011s | 0.0022s | 0.0009s | 0.0017s | 0.0011s | 0.0022s | 0.0009s | 0.0017s | 0.0011s | 0.76x | 0.49x
Single-threaded   | 0.0046s | 0.0054s | 0.0045s | 0.0045s | 0.0046s | 0.0054s | 0.0046s | 0.0045s | 0.0046s | 0.0054s | 0.0045s | 0.0045s | 0.99x | 0.99x
Mostly add        | 0.0022s | 0.0170s | 0.0046s | 0.0048s | 0.0023s | 0.0170s | 0.0055s | 0.0049s | 0.0023s | 0.0170s | 0.0050s | 0.0049s | 2.23x | 2.16x
Mostly remove     | 0.0042s | 0.0046s | 0.0041s | 0.0033s | 0.0042s | 0.0053s | 0.0044s | 0.0034s | 0.0042s | 0.0048s | 0.0043s | 0.0034s | 1.02x | 0.80x
Heavy concurrent  | 0.0018s | 0.0150s | 0.0048s | 0.0115s | 0.0019s | 0.0256s | 0.0050s | 0.0168s | 0.0018s | 0.0190s | 0.0049s | 0.0149s | 2.68x | 8.09x
Random concurrent | 0.0127s | 0.0158s | 0.0130s | 0.0130s | 0.0128s | 0.0161s | 0.0130s | 0.0131s | 0.0128s | 0.0160s | 0.0130s | 0.0130s | 1.02x | 1.02x

Average ops/s:
    ReaderWriterQueue:                  504.36 million
    BlockingReaderWriterCircularBuffer: 330.05 million
    SPSC queue:                         293.11 million
    Folly queue:                        452.94 million

```
@cameron314
Copy link
Owner

Interesting. Maybe try -Os -fomit-frame-pointer instead of -O3? I'd have to look at the disassembly to see what's different.

@cameron314
Copy link
Owner

No difference with -Os -fomit-frame-pointer. I looked a little more into it, but it's hard to tell exactly what's going on in the context of the full benchmark. In isolation, a simple "raw add" test performs very similarly between clang and gcc. The overall benchmark seems to vary between runs as well, although that might be because I'm running on a VM in the cloud.

I would suggest benchmarking a mock-up of your particular use case with both clang and gcc to see if in your particular case there's a stark difference in performance or not. In my experience, clang's optimizations tend to be hit-or-miss, and can vary depending on the surrounding context.

@mipac
Copy link
Author

mipac commented Apr 26, 2021

thanks for reply,
I can't investigate for the moment, I'll try later

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants