Performance benchmarks #70

vince- · 2022-11-16T00:19:40Z

Hello,

I'm wondering if any bench marking has been done for elements in the library? I'm trying to run the NEON enabled version of arm_biquad_cascade_df2T_f32 on a Cortex-A57 and I'm seeing performance that is roughly 80% slower than my standard optimized C code.

I've gone over the documentation and setup articles a bunch of time, I'm using GCC 7.3 and my Makefile sets the following options specifically for CMSIS-DSP:

CFLAGS += -D__GNUC_PYTHON__
CFLAGS += -DARM_MATH_NEON
CFLAGS += -fshort-enums
CFLAGS += -fshort-wchar
CFLAGS += -Ofast

Looking at the generated assembly it appears that it is indeed generating vectorized code, but not yielding faster execution. Is it possible that the neon routines are just comparatively slower on aarch64?

I have also tried the same test code compiled for A57 using GCC 8.3. And again natively on an Apple M1 machine using GCC 11 and Clang-14, the CMSIS-DSP cascaded biquad is always measurably slower.

Thanks!

christophe0606 · 2022-11-16T06:32:01Z

@vince- We have not made any benchmark so far on aarch64. Our focus has been Cortex-M and we are slowly adding Neon support and first starting from aarch32.

So, some functions may unfortunately don't have the expected performance improvement because we haven't tested yet.

Your compilation options look right.

Is your biquad long (several stages) ? What blockSize value are you using ?

I suspect that the filter implementation may add too much overhead (compared to a C version) for small values of blockSize or low number of stages and it may become efficient only for bigger values.

llefaucheur · 2022-11-16T07:49:50Z

Hi Vince, please also note the DF1 is more "vector friendly" than DF2. Regards, Laurent.

vince- · 2022-11-17T23:17:02Z

@christophe0606 Thanks for the reply. My current test biquad is 4 stages and I'm running a 32 sample block size. It makes sense that the overhead might counter any potential gains for a small filter like this.

@llefaucheur I did try the DF1, it appears worse than the DF2 on A57 as well.

christophe0606 · 2022-11-18T07:05:51Z

@vince- I tagged this issue as enhancement so that we look at it when we will have the bandwidth to start aarch64 benchmarking (not soon unfortunately).

llefaucheur · 2022-11-21T07:29:57Z

@vince- we have an other repo for experiments with DF1 for NEON with approximately Cycles(size, casc) = 4.125.casc.size + 75.casc (CA55) . At a time this code will come in the mainline

vince- · 2022-12-12T21:09:07Z

Merci @llefaucheur!

The DF1 for NEON indeed shows good gains. Appreciate both of you guys help and responsiveness on this issue.

christophe0606 added the review Under review label Nov 16, 2022

christophe0606 added enhancement New feature or request and removed review Under review labels Nov 18, 2022

jsksra1 mentioned this issue Jan 30, 2023

Extensive support for aarch64 #86

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance benchmarks #70

Performance benchmarks #70

vince- commented Nov 16, 2022

christophe0606 commented Nov 16, 2022

llefaucheur commented Nov 16, 2022

vince- commented Nov 17, 2022

christophe0606 commented Nov 18, 2022

llefaucheur commented Nov 21, 2022 •

edited

vince- commented Dec 12, 2022

Performance benchmarks #70

Performance benchmarks #70

Comments

vince- commented Nov 16, 2022

christophe0606 commented Nov 16, 2022

llefaucheur commented Nov 16, 2022

vince- commented Nov 17, 2022

christophe0606 commented Nov 18, 2022

llefaucheur commented Nov 21, 2022 • edited

vince- commented Dec 12, 2022

llefaucheur commented Nov 21, 2022 •

edited