-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve resource allocation and scheduling #171
Comments
Hi Raul, to generate the architecture you expect (with multiple multiplier units), the internal loop must be unrolled. You can do this using pragmas for the frontend compiler you selected. Furthermore, you should also consider the number of memory channels your design can exploit for parallel load/store operations: you may fully unroll the two for loops, but if you have only one memory channel this is not going to give you any benefit on the execution latency. |
Thanks Michele, Loop unrolling is done by the compiler optimization flag
I could not obtain lower latency, even after modifying the number of channels. Also, the number of DSPs remains the same, so I guess I'm missing something and no parallel multiplications are being done. Any other ideas or suggestions? |
Hi Raul, this is quite unusual to hear. What is the full command line that you are using to call bambu? Which frontend compiler have you chosen?
which gave me a 2462 cycles latency. |
Thanks for the explanation, Michele. Firstly, I am using Bambu version 0.9.6 (this is because I did a modification on this version to work with additional floating-point units, but I am not currently using it in this example, so I guess such modification is not the source of the problem).
which gave me a 4462 cycles latency. Adding the Do you think this is related to the version of Bambu/compiler I am using? |
Have you looked at the BB_FCFG.dot file generated by Bambu when --print-dot is passed? You will see what the compiler does once you add the unrolling pragma. We rely on the GCC/CLANG front end for such transformations. Another observation, --channels-type=MEM_ACC_11 allows only one memory operation per cycle. This may limit the performance even if you fully unroll your design. |
Just an additional note. Try option --disable-function-proxy to allow as many floating point unit as needed by the unrolling, |
Let me clarify a few things so that we are all on the same page:
Under this setup, I did multiple trials (without real success). Here are some findings:
I still don't know why loop unrolling is not getting as good results as the ones @Ansaya got (50% cycle reduction by just unrolling, and much more with memory improvements). |
Hi Raul, |
Hi Michele, May I ask for the full command line that you are using to obtain the small amount of 346 cycles? Did you do any modifications to the source code apart from the pragma unroll? |
Another problem I found is when including the option |
Hello everyone,
I am trying to improve the performance of the following kernel:
I am synthesizing with
-O3
for better performance. The report says it is using 2 DSP units. The problem is when I just synthesize a single multiplication as follows,it also uses 2 DSPs. So I guess Bambu is not instantiating multiple multipliers in parallel for the previous kernel, although there is no data dependency at the computation of
tmp_a
andtmp_b
, and neither between successive loop iterations.My intuition is that, if more instances of multiplication were used in parallel (which would require more DSPs), the kernel latency should improve dramatically.
Is there any way (by using directives, pragmas, Bambu options, etc.) to do this?
The text was updated successfully, but these errors were encountered: