Benchmarks #80

htot · 2022-05-18T20:39:49Z

I did some automated benchmarking on my i7-10700 and Edison (Merrifield dual core Silvermont Atom without cache memory, similar to Baytrail) that I want to share here. Strictly, this issue is for reference only. It might be useful to find those commits causing substantial performance increases or decreases. All data have been taken without OpenMP (1 thread only) and in x86_64 mode. On i7 you will see some deviation probably caused by frequency scaling / turbo boost. Don't let that disturb you.
Data can be found here if you want to play yourself benchmarks.ods

Below I filter out the most interesting commits.

Encoding

Note that on Edison SSE3 encoding took a hit with 9a0d1b2.

#	Hash	Commit message
24	`3f3f31c`	Fix build under Xcode
30	`67ee3fd`	SSSE3->AVX2 encoding optimization
76	`a5b6739`	SSSE3: enc: factor encoding loop into inline function
79	`99977db`	Generic64: enc: factor encoding loop into inline function
92	`e2c6687`	AVX2: enc: unroll inner loop
93	`9a0d1b2`	SSSE3: enc: unroll inner loop
96	`bf7341f`	Generic64: enc: unroll inner loop
114	`b8b3c58`	Generic64: enc: use 12-bit lookup table

Decoding

Especially for Edison it has been a bumpy ride, with great improvements 3f3f31c and regressions 0a69845 on SSE3 but also for PLAIN cfa8bf7 and f538baa.

#	Hash	Commit message
24	`3f3f31c`	Fix build under Xcode
29	`cfa8bf7`	Plain decoding optimization
35	`0a69845`	SSSE3->AVX2, NEON32 decoding optimization
85	`6310c1f`	SSSE3: dec: factor decoding loop into inline function
88	`f538baa`	Generic32: dec: factor decoding loop into inline function
100	`495414b`	AVX2: dec: unroll inner loop
101	`5874921`	SSSE3: dec: unroll inner loop

The text was updated successfully, but these errors were encountered:

htot · 2022-05-26T20:47:50Z

@aklomp are these in any way useful?

aklomp · 2022-06-02T11:29:58Z

@htot Thanks for your work. It's interesting to see that not all "improvements" to the library have led to actual improvements in real-world benchmarks. Which proves that we need to be careful when introducing new tricks, because some users may be worse off. That said, apart from SSSE3, the trend seems to be upward.

The SSSE3 thing could be due to register pressure. I think I saw the same degradation happen on my Atom N270 (a super weird processor, a 32-bit core with up to SSSE3 support) and when I tried to hand-optimize with inline assembler, I found out that that architecture has much less SSE registers available to it than big-boy x86's. Which results in lots of register moves and slow code. I didn't bother much with it because I considered that use case so niche...

I think these benchmarks are cool and might be useful as a jumping-off point for analyzing performance degradations in past commits, but apart from that I don't see a major use for them. The idea of graphing out performance over time is very powerful though, and I'll try to remember it for my toolbox.

htot · 2022-06-02T13:03:54Z

I think the Atom is like Baytrail (and Edison) a x86_64 CPU, they support SSSE3 but not AVX. The core is Silvermont (SLM) which has a penalty for long 64 bit instructions (complicated story), that might be the case here too (I have not test on i686 mode). If so, goldmont / airmont may behave completely different (but I don't have those here).

My i7-10700 btw appears to have 16MB L3 cache. So above benchmarks are not really usable (typically nobody ever would encode the same string twice). I patched to add a 100MB string and find that in all cases except "plain" we are near the bandwidth limit of the DDR. And even "plain" with openmp reaches bandwidth limit.

All these optimizations are useful in particular on the slow Atoms, but there we had a degradation. I'll add some improvements here, maybe you can label this "not a bug"?

aklomp · 2022-06-02T20:32:54Z

The Intel Atom N270 really is a 32-bit Diamondville core with SSSE3 extensions, as you can see on Intel's site. It's a very low power (2.5W TDP), passively cooled mobile processor with this weird feature mix for some reason. I've been using it in my home server for the last 12 years. Bit slow at times but gets the job done. Anyway.

I created a "benchmarking" label and added it to this issue. I'll leave it open for the time being, then.

htot · 2022-06-02T21:08:14Z

This is with 100MB buffer and OPENMP.

Here you see earlier i7 results were optimistic due to the L3 cache (Edison is not affected, it has no cache).

And something strange here with AVX2 decoding.

Nevertheless looking at the history

We know that the specialized encoding is much faster than plain, but as they are mostly DDR bandwidth limited, we don't see that.
However, plain has seen some nice improvements over time.

After the early improvements on decoding not much changed on i7. The decoding performance hit on Edison (Atom, or maybe even unique to Silvermont) is not present on i7. It would be worth reviving the old decoder for when Atom is detected.

htot · 2022-06-03T09:52:27Z

The Intel Atom N270 really is a 32-bit Diamondville core with SSSE3 extensions, as you can see on Intel's site. It's a very low power (2.5W TDP), passively cooled mobile processor with this weird feature mix for some reason. I've been using it in my home server for the last 12 years. Bit slow at times but gets the job done. Anyway.

I created a "benchmarking" label and added it to this issue. I'll leave it open for the time being, then.

I see. That's confusing, there are also Diamondville CPUs with 64-bit (Atom 230).

aklomp added the benchmarking label Jun 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarks #80

Benchmarks #80

htot commented May 18, 2022

htot commented May 26, 2022

aklomp commented Jun 2, 2022

htot commented Jun 2, 2022

aklomp commented Jun 2, 2022 •

edited

htot commented Jun 2, 2022

htot commented Jun 3, 2022

Benchmarks #80

Benchmarks #80

Comments

htot commented May 18, 2022

Encoding

Decoding

htot commented May 26, 2022

aklomp commented Jun 2, 2022

htot commented Jun 2, 2022

aklomp commented Jun 2, 2022 • edited

htot commented Jun 2, 2022

htot commented Jun 3, 2022

aklomp commented Jun 2, 2022 •

edited