-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
very high cpu usage #66
Comments
yes, the lighter on cpu, the better. |
Faust comes with a script that automates that. |
Put in place some benchmarks, here come results. Test 0: default flags, nothing extra added (cold)
Test 0: default flags, nothing extra added (warm)
Test 1: using
Test 2: using
Test 3: using
Test 4: using
Test 5: using
Test 6: using
Final test: enabling ALL the flags, that is,
Hopefully now we can see some patterns. |
Best seems to usually be Sadly some of these tests were invalid. |
Doing same tests now on a x64 cpu, "Intel(R) Core(TM) i7-6500U CPU @ 2.50GHz" reported by /proc/cpuinfo because this laptop takes a seriously long time to run these, I will do 1 post per type, so I dont accidentally lose precious data |
none/default
|
using
So using |
using
not much difference here. likely faust is declaring the float vs double variables properly and thus this specific optimization is not needed. |
To make sure I am not testing things that have no benefit, I jumped to run the last test with all the flags. same deal as before. these are the results:
this shows small, but definitive improvements on average. |
Using prefetch:
Using tree-vectorize:
|
faustbench-llvm explores a bit more: https://github.com/grame-cncm/faust/tree/master-dev/tools/benchmark#faustbench-llvm |
thanks, I saw it but didnt think it was too relevant here. I am not looking just for what faust flags can do, but compiler flags too, some of them unsupported by llvm/clang. |
Using
this seems to be one of the good flags :) |
Using
Seems a good combo, but I am not too confident about using Still, with this it seems obvious the best choice is between |
I tried the fastmath.cpp stuff, and while performance is better with it, sound is also affected. So after a few more tests here, I decided to go with the faust crashes when using those flags on the macOS CI machine though :( |
No clear at all what happens... |
just a crash/segfault. |
@sletz is there a way to separate interpolation of coefficients into a dedicated function and only call it every e.g. 48 samples? e.g. gain Is it possible to separate metering from the main DSP? Input meters can run before doing any processing, and output meters after. This is often done to avoid conditionals in the inner loop and make use of L1/L2 caches. Then there are lines like
The |
Read this for general info on optimisations |
Yes, but can control-rate updates be done at a given interval in samples (or time unit) rather than being considered constant during the block? Is there a way to merge maths operations like |
Not easily for now. But we could add an option to separate the control code (done in the
The compiler currently implements some simplifications like |
Oddly enough in the generated C++ code there are many of those constructs. A partial list (from
Another example is
In this case Then there's code from e.g. "Vectorizable loop 123"
I doubt that this can be vectorized. two divisions, two sums and a multiplication. What intrinsic function provides for this?
All in all the generated code is however pretty impressive. Writing a complex project like master_me directly in (hand optimized) C would be a significantly more complex task, and it would certainly not be as easy to have it evolve. |
Well here
I don't see the pattern here, where are they saved division ?
When the compiler write this, it means no recursive dependancy exists in the loop. Then we assume the auto-vectoriser can do something efficient here. Have you checked what the compiler produces here? Is your version better? |
Indeed. I missed that . It is used to update the SVF coefficients every sample. That's a different story.
can be written as
Apparently so: https://godbolt.org/z/r88dszdzP On most CPUs |
OK it seems some optimisations are missed here. Possibly in the way we compute the signals "normal form" where
With auto-vectorisation we coud expect SIMD operation to be used, so we should probably compare complete loops |
I missed one substitution:
Perhaps with AVX or FMA, but there's no SSE intrinsic that would work here; but even then multiplication is faster. |
FMA Fused Multiply-Add can be used with the refactored code: Edit: gcc can vectorize it https://godbolt.org/z/x8466WPMY |
In principle, one can safely use pow() and let the compiler do the optimization (with -ffast-math) as in these examples: https://godbolt.org/z/sdvxor3r3. Concerning the division not factorized, we will have to see if we can solve the problem by improving the normal form. Concerning the expressions in bargraphs, we need to improve the type system to take into account the case of a control rate expression built on top of a sample rate expression. |
Not sure if this is the proper place but just running master_me disabled, and no in/out connected, the DSP load in Carla goes from 3% to around 50%. |
Sounds like a denormal issue (https://en.wikipedia.org/wiki/Subnormal_number#Performance_issues) Could be avoided either via compiler options, or by adding a tiny number to the input (of each stage). |
This code can be used: https://github.com/grame-cncm/faust/blob/master-dev/architecture/faust/dsp/dsp.h#L236, so adding the |
issue should be pushed to carla. I intentionally set up the audio threads so that denormals are not a thing. if they appear, something is wrong.. |
I don't see how that would be related to Carla as master_me is running standalone. |
ah you mentioned carla on your post, so it lead the thought it was running there. |
Denormal issue likely fixed in f9992cd, as I just pushed DISTRHO/DPF@48eb450 to DPF side. |
opening a ticket to generate a discussion around this.
currently the plugin is quite heavy, ideas for optimizing its cpu usage would be quite welcome.
we could try a few compiler optimization flags and see what works best.
also reducing the gui-oriented calls on the dsp side, as mentioned in other tickets.
this can be a blocker for some people doing live-streams, as the capturing + recording takes a significant amount of cpu. if audio processing does too, the system might not be that much responsive when all parts are on.
The text was updated successfully, but these errors were encountered: