New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Grace / V2 #863
Comments
No, right now nobody is working on SVE128 or V2 support, simply as we don't have access to hardware. Any help on this welcome :-) |
This will likely require some work on the |
Do we know the FMA issue rate and the latency for back to back accumulations? This will tell us how many accumulators we need to use and can compare that against what the routine currently uses.
|
I am only aware of the numbers in the software optimization guides: https://developer.arm.com/documentation/PJDOC-466751330-593177/latest/ |
I believe the ASIMD kernel is using in register broadcast, doesn't it? Perhaps the x86 style explicit broadcast and full SIMD FMA is faster? |
Currently i have an assembly kernel for NEON on the Grace CPU that achieves 102 GFLOPS on small fixed-size matrix, closely approaching the peak performance of 105 GFLOPS. |
On A64FX I observed a heavy penalty for using the register broadcasting. However, the ARM-designed cores seem to handle this quite well. |
Haven't seen bcasting issues on Neoverse either. But might be worth to double-check for V2. |
NVIDIA Grace is not discovered properly:
I believe that 128-bit SVE is not operational right now. Is somebody working on that?
In either case: Discovering Neon as target should be a simple fix; setting
LIBXSMM_TARGET=aarch64
works perfectly fine. Is somebody working on that already?The text was updated successfully, but these errors were encountered: