Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance benchmarks #70

Open
vince- opened this issue Nov 16, 2022 · 6 comments
Open

Performance benchmarks #70

vince- opened this issue Nov 16, 2022 · 6 comments
Labels
enhancement New feature or request

Comments

@vince-
Copy link

vince- commented Nov 16, 2022

Hello,

I'm wondering if any bench marking has been done for elements in the library? I'm trying to run the NEON enabled version of arm_biquad_cascade_df2T_f32 on a Cortex-A57 and I'm seeing performance that is roughly 80% slower than my standard optimized C code.

I've gone over the documentation and setup articles a bunch of time, I'm using GCC 7.3 and my Makefile sets the following options specifically for CMSIS-DSP:

  • CFLAGS += -D__GNUC_PYTHON__
  • CFLAGS += -DARM_MATH_NEON
  • CFLAGS += -fshort-enums
  • CFLAGS += -fshort-wchar
  • CFLAGS += -Ofast

Looking at the generated assembly it appears that it is indeed generating vectorized code, but not yielding faster execution. Is it possible that the neon routines are just comparatively slower on aarch64?

I have also tried the same test code compiled for A57 using GCC 8.3. And again natively on an Apple M1 machine using GCC 11 and Clang-14, the CMSIS-DSP cascaded biquad is always measurably slower.

Thanks!

@christophe0606
Copy link
Contributor

@vince- We have not made any benchmark so far on aarch64. Our focus has been Cortex-M and we are slowly adding Neon support and first starting from aarch32.

So, some functions may unfortunately don't have the expected performance improvement because we haven't tested yet.

Your compilation options look right.

Is your biquad long (several stages) ? What blockSize value are you using ?

I suspect that the filter implementation may add too much overhead (compared to a C version) for small values of blockSize or low number of stages and it may become efficient only for bigger values.

@christophe0606 christophe0606 added the review Under review label Nov 16, 2022
@llefaucheur
Copy link

Hi Vince, please also note the DF1 is more "vector friendly" than DF2. Regards, Laurent.

@vince-
Copy link
Author

vince- commented Nov 17, 2022

@christophe0606 Thanks for the reply. My current test biquad is 4 stages and I'm running a 32 sample block size. It makes sense that the overhead might counter any potential gains for a small filter like this.

@llefaucheur I did try the DF1, it appears worse than the DF2 on A57 as well.

@christophe0606 christophe0606 added enhancement New feature or request and removed review Under review labels Nov 18, 2022
@christophe0606
Copy link
Contributor

@vince- I tagged this issue as enhancement so that we look at it when we will have the bandwidth to start aarch64 benchmarking (not soon unfortunately).

@llefaucheur
Copy link

llefaucheur commented Nov 21, 2022

@vince- we have an other repo for experiments with DF1 for NEON with approximately Cycles(size, casc) = 4.125.casc.size + 75.casc (CA55) . At a time this code will come in the mainline

@vince-
Copy link
Author

vince- commented Dec 12, 2022

Merci @llefaucheur!

The DF1 for NEON indeed shows good gains. Appreciate both of you guys help and responsiveness on this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants