Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Grace / V2 #863

Open
breuera opened this issue Feb 21, 2024 · 9 comments
Open

Grace / V2 #863

breuera opened this issue Feb 21, 2024 · 9 comments

Comments

@breuera
Copy link
Contributor

breuera commented Feb 21, 2024

NVIDIA Grace is not discovered properly:

LIBXSMM ERROR (libxsmm_generator_mateltwise_aarch64_update_micro_kernel_config_vectorlength): unknown architecture (error #90005)!
M blocking found is 2
LIBXSMM ERROR (libxsmm_generator_gemm_kernel): unknown architecture or unsupported precision (error #90004)!
LIBXSMM ERROR (libxsmm_generator_mateltwise_aarch64_update_micro_kernel_config_vectorlength): unknown architecture (error #90005)!
M blocking found is 3
LIBXSMM ERROR (libxsmm_generator_gemm_kernel): unknown architecture or unsupported precision (error #90004)!

I believe that 128-bit SVE is not operational right now. Is somebody working on that?

In either case: Discovering Neon as target should be a simple fix; setting LIBXSMM_TARGET=aarch64 works perfectly fine. Is somebody working on that already?

@alheinecke
Copy link
Collaborator

No, right now nobody is working on SVE128 or V2 support, simply as we don't have access to hardware. Any help on this welcome :-)

@FreddieWitherden
Copy link
Contributor

This will likely require some work on the asparse_reg side too as we would want to consider both NEON and SVE128 kernels (as it is not clean that 128-bit SVE will outperform NEON whereas 256-bit SVE is quite clear cut).

@breuera
Copy link
Contributor Author

breuera commented Mar 4, 2024

Here's a full sweep for matrix-matrix mult. using ASIMD/Neon and FP32 for M,N,K in [1, ..., 64].
In the plots K is fixed to K=64 (raw data below).
Figure_1
Fixing N=6:
Figure_2
Fixing N=32:
Figure_3
Fixing N=64:
Figure_4

Overall the performance seems to be too low.
Achievable peak is at about 105 FP32 GFLOPS.
NVPL BLAS GEMM gets that (for way larger matrices, M,N,K >~1024).

We'll now have a look at how to write a fast assembly kernel for a single case.
This should help us optimizing the 128-bit JITters.
I'd also doubt that SVE128 probably helps much over Neon.

perf_grid_grace_sgemm_xsmm.csv

@FreddieWitherden
Copy link
Contributor

FreddieWitherden commented Mar 4, 2024 via email

@breuera
Copy link
Contributor Author

breuera commented Mar 8, 2024

I am only aware of the numbers in the software optimization guides: https://developer.arm.com/documentation/PJDOC-466751330-593177/latest/

@alheinecke
Copy link
Collaborator

I believe the ASIMD kernel is using in register broadcast, doesn't it? Perhaps the x86 style explicit broadcast and full SIMD FMA is faster?

@stefan0re
Copy link

Currently i have an assembly kernel for NEON on the Grace CPU that achieves 102 GFLOPS on small fixed-size matrix, closely approaching the peak performance of 105 GFLOPS.

@FreddieWitherden
Copy link
Contributor

I believe the ASIMD kernel is using in register broadcast, doesn't it? Perhaps the x86 style explicit broadcast and full SIMD FMA is faster?

On A64FX I observed a heavy penalty for using the register broadcasting. However, the ARM-designed cores seem to handle this quite well.

@breuera
Copy link
Contributor Author

breuera commented Mar 8, 2024

Haven't seen bcasting issues on Neoverse either. But might be worth to double-check for V2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants