Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

very high cpu usage #66

Open
falkTX opened this issue Aug 28, 2022 · 35 comments
Open

very high cpu usage #66

falkTX opened this issue Aug 28, 2022 · 35 comments

Comments

@falkTX
Copy link
Collaborator

falkTX commented Aug 28, 2022

opening a ticket to generate a discussion around this.
currently the plugin is quite heavy, ideas for optimizing its cpu usage would be quite welcome.

we could try a few compiler optimization flags and see what works best.
also reducing the gui-oriented calls on the dsp side, as mentioned in other tickets.

this can be a blocker for some people doing live-streams, as the capturing + recording takes a significant amount of cpu. if audio processing does too, the system might not be that much responsive when all parts are on.

@trummerschlunk
Copy link
Owner

yes, the lighter on cpu, the better.
this is quite out of my expertise, so your ideas are very welcome.

@magnetophon
Copy link
Collaborator

we could try a few compiler optimization flags and see what works best.

Faust comes with a script that automates that.

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

Put in place some benchmarks, here come results.
Will post the data points for -scal (which is often considered default/normal optimization) and whichever ends up being best from all the faust provided options.
All of these have the default build flags from DPF (-O3 -fast-math -mtune=generic etc) with LTO enabled.
Tests were run on a mac-mini M1.

Test 0: default flags, nothing extra added (cold)

-scal : 9.63834 MBytes/sec (DSP CPU % : 7.08506 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1962 MBytes/sec (DSP CPU % : 6.66458 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 0: default flags, nothing extra added (warm)

-scal : 9.63295 MBytes/sec (DSP CPU % : 7.10624 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1784 MBytes/sec (DSP CPU % : 6.69842 at 44100 Hz) with -vec -lv 0 -vs 8

Test 1: using -Ofast

-scal : 9.63711 MBytes/sec (DSP CPU % : 7.10111 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1819 MBytes/sec (DSP CPU % : 6.70025 at 44100 Hz) with -vec -lv 0 -vs 8

Test 2: using -fprefetch-loop-arrays

-scal : 9.64519 MBytes/sec (DSP CPU % : 7.0923 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1949 MBytes/sec (DSP CPU % : 6.68685 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 3: using -fsingle-precision-constant

-scal : 9.64692 MBytes/sec (DSP CPU % : 7.08903 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1922 MBytes/sec (DSP CPU % : 6.6952 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 4: using -ftree-vectorize

-scal : 9.63839 MBytes/sec (DSP CPU % : 7.0859 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1845 MBytes/sec (DSP CPU % : 6.70669 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 5: using -funroll-loops

-scal : 9.64829 MBytes/sec (DSP CPU % : 7.08453 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1973 MBytes/sec (DSP CPU % : 6.67201 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Test 6: using -fprefetch-loop-arrays -funroll-loops -funsafe-loop-optimizations combo

-scal : 9.64283 MBytes/sec (DSP CPU % : 7.09134 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1936 MBytes/sec (DSP CPU % : 6.67486 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Final test: enabling ALL the flags, that is, -Ofast -fomit-frame-pointer -fprefetch-loop-arrays -fsingle-precision-constant -ftree-vectorize -funroll-loops -funsafe-loop-optimizations

-scal : 9.6323 MBytes/sec (DSP CPU % : 7.07991 at 44100 Hz), DSP struct memory size in bytes : 53743544
Best value is : 10.1931 MBytes/sec (DSP CPU % : 6.78578 at 44100 Hz) with -vec -fun -lv 0 -vs 8

Hopefully now we can see some patterns.

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

Best seems to usually be -vec -fun -lv 0 -vs 8 for faust options.

Sadly some of these tests were invalid.
clang does not support -fprefetch-loop-arrays or -fsingle-precision-constant, so I will have to run these tests with a different compiler or system.

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

Doing same tests now on a x64 cpu, "Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz" reported by /proc/cpuinfo
Running each 5 times, to get an average

because this laptop takes a seriously long time to run these, I will do 1 post per type, so I dont accidentally lose precious data

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

none/default

-scal : 4.2968 MBytes/sec (DSP CPU % : 16.3113 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.29377 MBytes/sec (DSP CPU % : 16.4532 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.27541 MBytes/sec (DSP CPU % : 16.3647 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.24756 MBytes/sec (DSP CPU % : 16.3387 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.27696 MBytes/sec (DSP CPU % : 16.3153 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.34468 MBytes/sec (DSP CPU % : 13.2309 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.34045 MBytes/sec (DSP CPU % : 12.9906 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.33833 MBytes/sec (DSP CPU % : 13.0843 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.33762 MBytes/sec (DSP CPU % : 13.1468 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.35286 MBytes/sec (DSP CPU % : 12.9168 at 44100 Hz) with -vec -lv 0 -g -vs 8

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

using -Ofast

-scal : 4.27438 MBytes/sec (DSP CPU % : 16.8819 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.28457 MBytes/sec (DSP CPU % : 16.3201 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.29002 MBytes/sec (DSP CPU % : 16.3388 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.30806 MBytes/sec (DSP CPU % : 16.159 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.30075 MBytes/sec (DSP CPU % : 16.1666 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.31578 MBytes/sec (DSP CPU % : 13.2509 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.35012 MBytes/sec (DSP CPU % : 13.0298 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.32876 MBytes/sec (DSP CPU % : 13.9076 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.35508 MBytes/sec (DSP CPU % : 12.9712 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.34027 MBytes/sec (DSP CPU % : 13.09 at 44100 Hz) with -vec -lv 0 -g -vs 8

So using -Ofast vs -O3 (default) doesn't really lead to gain here, all is within margin of error.

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

using -fsingle-precision-constant (this one actually reliable, as gcc has that option while clang does not)

-scal : 4.2803 MBytes/sec (DSP CPU % : 16.9046 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.28916 MBytes/sec (DSP CPU % : 16.1392 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.26237 MBytes/sec (DSP CPU % : 16.5876 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.26309 MBytes/sec (DSP CPU % : 16.3797 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.27941 MBytes/sec (DSP CPU % : 16.3108 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.33902 MBytes/sec (DSP CPU % : 13.5098 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.33681 MBytes/sec (DSP CPU % : 13 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.31593 MBytes/sec (DSP CPU % : 13.019 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.32571 MBytes/sec (DSP CPU % : 13.0862 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.33375 MBytes/sec (DSP CPU % : 13.1037 at 44100 Hz) with -vec -lv 1 -vs 8

not much difference here. likely faust is declaring the float vs double variables properly and thus this specific optimization is not needed.

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

To make sure I am not testing things that have no benefit, I jumped to run the last test with all the flags. same deal as before. these are the results:

-scal : 4.34399 MBytes/sec (DSP CPU % : 15.9678 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.34678 MBytes/sec (DSP CPU % : 15.9518 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.37782 MBytes/sec (DSP CPU % : 15.9254 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.36811 MBytes/sec (DSP CPU % : 15.9093 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.36748 MBytes/sec (DSP CPU % : 16.1839 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.48767 MBytes/sec (DSP CPU % : 12.6423 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.49236 MBytes/sec (DSP CPU % : 12.8233 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.48057 MBytes/sec (DSP CPU % : 12.6647 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.44528 MBytes/sec (DSP CPU % : 13.0698 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.4862 MBytes/sec (DSP CPU % : 12.7427 at 44100 Hz) with -vec -lv 1 -vs 8

this shows small, but definitive improvements on average.
so some of the flags are not just placebo but are doing something.
we just need to find which ones now

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 29, 2022

Using prefetch:

-scal : 4.30744 MBytes/sec (DSP CPU % : 16.3677 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.36694 MBytes/sec (DSP CPU % : 15.9433 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.35577 MBytes/sec (DSP CPU % : 16.2389 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.36447 MBytes/sec (DSP CPU % : 18.2705 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.34609 MBytes/sec (DSP CPU % : 16.1036 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.43295 MBytes/sec (DSP CPU % : 13.4085 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.45406 MBytes/sec (DSP CPU % : 12.7562 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.45127 MBytes/sec (DSP CPU % : 17.2963 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.44205 MBytes/sec (DSP CPU % : 12.7868 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.45081 MBytes/sec (DSP CPU % : 12.6932 at 44100 Hz) with -vec -lv 0 -g -vs 8

Using tree-vectorize:

-scal : 4.26637 MBytes/sec (DSP CPU % : 16.5673 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.28826 MBytes/sec (DSP CPU % : 16.2644 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.3046 MBytes/sec (DSP CPU % : 16.1537 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.3165 MBytes/sec (DSP CPU % : 16.1184 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.28343 MBytes/sec (DSP CPU % : 16.6418 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.35331 MBytes/sec (DSP CPU % : 12.9901 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.35234 MBytes/sec (DSP CPU % : 14.1635 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.32448 MBytes/sec (DSP CPU % : 13.0306 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.32559 MBytes/sec (DSP CPU % : 13.0401 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.34056 MBytes/sec (DSP CPU % : 13.2805 at 44100 Hz) with -vec -lv 0 -g -vs 8

@sletz
Copy link

sletz commented Aug 29, 2022

we could try a few compiler optimization flags and see what works best.

Faust comes with a script that automates that.

faustbench-llvm explores a bit more: https://github.com/grame-cncm/faust/tree/master-dev/tools/benchmark#faustbench-llvm

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 30, 2022

we could try a few compiler optimization flags and see what works best.

Faust comes with a script that automates that.

faustbench-llvm explores a bit more: https://github.com/grame-cncm/faust/tree/master-dev/tools/benchmark#faustbench-llvm

thanks, I saw it but didnt think it was too relevant here. I am not looking just for what faust flags can do, but compiler flags too, some of them unsupported by llvm/clang.

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 30, 2022

Using -funroll-loops

-scal : 4.2748 MBytes/sec (DSP CPU % : 16.1979 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.3259 MBytes/sec (DSP CPU % : 16.1251 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.30736 MBytes/sec (DSP CPU % : 16.1814 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.33969 MBytes/sec (DSP CPU % : 16.0493 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.31645 MBytes/sec (DSP CPU % : 16.0979 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.37112 MBytes/sec (DSP CPU % : 12.9637 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.38857 MBytes/sec (DSP CPU % : 13.0337 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.37493 MBytes/sec (DSP CPU % : 12.9628 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.37612 MBytes/sec (DSP CPU % : 12.9511 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.37365 MBytes/sec (DSP CPU % : 12.8953 at 44100 Hz) with -vec -lv 1 -vs 8

this seems to be one of the good flags :)

@falkTX
Copy link
Collaborator Author

falkTX commented Aug 30, 2022

Using -fprefetch-loop-arrays -funroll-loops -funsafe-loop-optimizations combo

-scal : 4.36762 MBytes/sec (DSP CPU % : 16.2816 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.3862 MBytes/sec (DSP CPU % : 16.0541 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.38827 MBytes/sec (DSP CPU % : 15.8699 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.38965 MBytes/sec (DSP CPU % : 15.872 at 44100 Hz), DSP struct memory size in bytes : 53743544
-scal : 4.38993 MBytes/sec (DSP CPU % : 15.7911 at 44100 Hz), DSP struct memory size in bytes : 53743544

Best value is : 5.46754 MBytes/sec (DSP CPU % : 12.6809 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.48121 MBytes/sec (DSP CPU % : 12.8436 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.49011 MBytes/sec (DSP CPU % : 12.6356 at 44100 Hz) with -vec -lv 1 -vs 8
Best value is : 5.48365 MBytes/sec (DSP CPU % : 12.7177 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.48365 MBytes/sec (DSP CPU % : 12.7177 at 44100 Hz) with -vec -lv 0 -g -vs 8
Best value is : 5.48838 MBytes/sec (DSP CPU % : 12.6292 at 44100 Hz) with -vec -lv 0 -g -vs 8

Seems a good combo, but I am not too confident about using -funsafe-loop-optimizations.

Still, with this it seems obvious the best choice is between -vec -lv 0 -g -vs 8 and -vec -lv 1 -vs 8.
So we can optimize the benchmarks to pick only these 2, and run more specific ones for the compiler flags.
The issue running these before was that I had all the faust modes enabled, so it took a while to build and run.

@falkTX
Copy link
Collaborator Author

falkTX commented Sep 1, 2022

I tried the fastmath.cpp stuff, and while performance is better with it, sound is also affected.
We end up with a sorta high-pass eq filter on at all times, and random noise bursts. Really unusable.

So after a few more tests here, I decided to go with the -vec -lv 1 -vs 8 faust flags. The -exp10 seemed promising on first glance so I tried it too, but it was half-half best result compared to not having it, so seems to not do much or anything at all.

faust crashes when using those flags on the macOS CI machine though :(
so I pregenerated the plugin C++ files and added them to the repo. There are no details to debug, but build issue can be seen in https://github.com/trummerschlunk/master_me/runs/8129699558?check_suite_focus=true

@sletz
Copy link

sletz commented Sep 1, 2022

No clear at all what happens...

@falkTX
Copy link
Collaborator Author

falkTX commented Sep 1, 2022

just a crash/segfault.
I cant reproduce it locally, so it is hard to investigate.
could be related to limited RAM on the github VMs too, as macOS will kill processes that hog ram/cpu too much.

@x42
Copy link
Contributor

x42 commented Sep 2, 2022

@sletz is there a way to separate interpolation of coefficients into a dedicated function and only call it every e.g. 48 samples?

e.g. gain si.smoo is perfectly fine to do small jumps at ~ 20 Hz intervals. Likewise many other coefficients that use pow() could only be updated lazily.

Is it possible to separate metering from the main DSP? Input meters can run before doing any processing, and output meters after. This is often done to avoid conditionals in the inner loop and make use of L1/L2 caches.

Then there are lines like

          fZec582[i] = std::pow(10.0f, 0.0250000004f * (0.0f - 0.333333343f * fZec581[i]));
          fZec583[i] = std::sqrt(fZec582[i]);

The sqrt can be avoided by making a single call to pow (10, 1/2 * ...). Then you can even use
exp (a * log(b)) = pow (b, a) ; b is constant here and can be evaluated at compile-time. exp() is significantly faster than pow().

@sletz
Copy link

sletz commented Sep 2, 2022

Read this for general info on optimisations

@x42
Copy link
Contributor

x42 commented Sep 2, 2022

Read this for general info on optimisations

Yes, but can control-rate updates be done at a given interval in samples (or time unit) rather than being considered constant during the block?

Is there a way to merge maths operations like sqrt (pow()) into a single call to exp() or is that left to the user?

@sletz
Copy link

sletz commented Sep 2, 2022

Read this for general info on optimisations

Yes, but can control-rate updates be done at a given interval in samples (or time unit) rather than being considered constant during the block?

Not easily for now. But we could add an option to separate the control code (done in the compute function before the actual DSP loop) in a separated "control" function, to be called when needed by the architecture file.

Is there a way to merge maths operations like sqrt (pow()) into a single call to exp() or is that left to the user?

The compiler currently implements some simplifications like exp(log(x)) = x. We could add some more. Can you list which one would be the more useful?

@x42
Copy link
Contributor

x42 commented Sep 2, 2022

Can you list which one would be the more useful?

  • sqrt(pow(a, b)) -- move the sqrt into the exponent: pow (a, b * 0.5f)
  • exp (a * log(b)) = pow (b, a) -- if b is constant (e.g 10) , use constexpr float l10 = log (10); return exp (a * l10);

Oddly enough in the generated C++ code there are many of those constructs. A partial list (from faust -vec -lv 0 -g -vs 8 master_me.dsp using c03ce7d) while in the .dsp file there are only 2 calls to sqrt for the correlation meter. Where does the 10^x come from?

          fZec275[i] = std::pow(10.0f, 0.00833333377f * fZec254[i]);
          fZec276[i] = std::sqrt(fZec275[i]); 
...
          fZec308[i] = std::pow(10.0f, 0.0250000004f * (0.0f - 0.333333343f * fZec307[i]));
          fZec309[i] = std::sqrt(fZec308[i]);
...
          fZec328[i] = std::pow(10.0f, 0.00833333377f * fZec307[i]); 
          fZec329[i] = std::sqrt(fZec328[i]);   
...
          fZec361[i] = std::pow(10.0f, 0.0250000004f * (0.0f - 0.333333343f * fZec360[i]));
          fZec362[i] = std::sqrt(fZec361[i]);  
...
          fZec381[i] = std::pow(10.0f, 0.00833333377f * fZec360[i]);                                                                                                                        
          fZec382[i] = std::sqrt(fZec381[i]);

Another example is

          fZec711[i] = fSlow99 / fZec709[i];
          fZec712[i] = fSlow99 * (fZec711[i] + 2.0f) / fZec709[i] + 1.0f; 

In this case (fSlow99 / fZec709[i]) would only need to be computed once. One division can be saved. There are a many such instances in the compute loop (I stopped counting at 20). Saving 20+ divisions per sample is already a lot.


Then there's code from e.g. "Vectorizable loop 123"

fZec41[i] = fSlow35 * (fSlow35 / fZec39[i] + 1.42857146f) / fZec39[i] + 1.0f;                                                                                                               

I doubt that this can be vectorized. two divisions, two sums and a multiplication. What intrinsic function provides for this?
This can be expanded to:

 float const tmp = fSlow35 / fZec39[i];
 fZec41[i] = tmp * tmp  + 1.42857146f * tmp  + 1.f;

All in all the generated code is however pretty impressive. Writing a complex project like master_me directly in (hand optimized) C would be a significantly more complex task, and it would certainly not be as easy to have it evolve.

@sletz
Copy link

sletz commented Sep 3, 2022

fZec275[i] = std::pow(10.0f, 0.00833333377f * fZec254[i]);
fZec276[i] = std::sqrt(fZec275[i]);

Well here fZec275[i] is used several times in the code later on (which is the default behaviour: shared sub-expressions avec computed once in a variable, then reused), so not sure it will help...

fZec711[i] = fSlow99 / fZec709[i];

I don't see the pattern here, where are they saved division ?

"Vectorizable loop 123"

When the compiler write this, it means no recursive dependancy exists in the loop. Then we assume the auto-vectoriser can do something efficient here. Have you checked what the compiler produces here? Is your version better?

@x42
Copy link
Contributor

x42 commented Sep 3, 2022

Well here fZec275[i] is used several times in the code later on

Indeed. I missed that . It is used to update the SVF coefficients every sample. That's a different story.

I don't see the pattern here, where are they saved division ?

fZec711[i] = fSlow99 / fZec709[i];
fZec712[i] = fSlow99 * (fZec711[i] + 2.0f) / fZec709[i] + 1.0f; 

can be written as

float const tmp = fSlow99 / fZec709[i];
fZec711[i] = tmp;
fZec712[i] = tmp * (fZec711[i] + 2.0f) + 1.0f; 

"Vectorizable loop 123"

Is your version better?

Apparently so: https://godbolt.org/z/r88dszdzP
It needs only 2 registers (not 3), and performs only one division.

On most CPUs divss is 4-5 times slower than mulss (which usually takes only 1 or 2 cycles). Apple's M1 (fdiv) is a notable exception.

@sletz
Copy link

sletz commented Sep 3, 2022

float const tmp = fSlow99 / fZec709[i];
fZec711[i] = tmp;
fZec712[i] = tmp * (fZec711[i] + 2.0f) + 1.0f;

OK it seems some optimisations are missed here. Possibly in the way we compute the signals "normal form" where + and * operations are supposed to be sorted in an optimal way. @orlarey any idea here?

Is your version better?

With auto-vectorisation we coud expect SIMD operation to be used, so we should probably compare complete loops

@x42
Copy link
Contributor

x42 commented Sep 3, 2022

I missed one substitution:

float const tmp = fSlow99 / fZec709[i];
fZec711[i] = tmp;
fZec712[i] = tmp * (tmp + 2.0f) + 1.0f; 

With auto-vectorisation we coud expect SIMD operation to be used

Perhaps with AVX or FMA, but there's no SSE intrinsic that would work here; but even then multiplication is faster.

@x42
Copy link
Contributor

x42 commented Sep 3, 2022

FMA Fused Multiply-Add can be used with the refactored code:
https://godbolt.org/z/hraMPEY9f
but it's still not SIMD.

Edit: gcc can vectorize it https://godbolt.org/z/x8466WPMY

@orlarey
Copy link

orlarey commented Sep 4, 2022

In principle, one can safely use pow() and let the compiler do the optimization (with -ffast-math) as in these examples: https://godbolt.org/z/sdvxor3r3.

Concerning the division not factorized, we will have to see if we can solve the problem by improving the normal form.

Concerning the expressions in bargraphs, we need to improve the type system to take into account the case of a control rate expression built on top of a sample rate expression.

@galileo-pkm
Copy link

Not sure if this is the proper place but just running master_me disabled, and no in/out connected, the DSP load in Carla goes from 3% to around 50%.

@x42
Copy link
Contributor

x42 commented Oct 13, 2022

Sounds like a denormal issue (https://en.wikipedia.org/wiki/Subnormal_number#Performance_issues)

Could be avoided either via compiler options, or by adding a tiny number to the input (of each stage).

@sletz
Copy link

sletz commented Oct 13, 2022

This code can be used: https://github.com/grame-cncm/faust/blob/master-dev/architecture/faust/dsp/dsp.h#L236, so adding the AVOIDDENORMALS macro before the call to compute.

@falkTX
Copy link
Collaborator Author

falkTX commented Oct 13, 2022

issue should be pushed to carla. I intentionally set up the audio threads so that denormals are not a thing. if they appear, something is wrong..

@galileo-pkm
Copy link

I don't see how that would be related to Carla as master_me is running standalone.

@falkTX
Copy link
Collaborator Author

falkTX commented Oct 13, 2022

ah you mentioned carla on your post, so it lead the thought it was running there.

@falkTX
Copy link
Collaborator Author

falkTX commented Oct 13, 2022

Denormal issue likely fixed in f9992cd, as I just pushed DISTRHO/DPF@48eb450 to DPF side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants