Grace / V2 #863

breuera · 2024-02-21T17:05:55Z

NVIDIA Grace is not discovered properly:

LIBXSMM ERROR (libxsmm_generator_mateltwise_aarch64_update_micro_kernel_config_vectorlength): unknown architecture (error #90005)!
M blocking found is 2
LIBXSMM ERROR (libxsmm_generator_gemm_kernel): unknown architecture or unsupported precision (error #90004)!
LIBXSMM ERROR (libxsmm_generator_mateltwise_aarch64_update_micro_kernel_config_vectorlength): unknown architecture (error #90005)!
M blocking found is 3
LIBXSMM ERROR (libxsmm_generator_gemm_kernel): unknown architecture or unsupported precision (error #90004)!

I believe that 128-bit SVE is not operational right now. Is somebody working on that?

In either case: Discovering Neon as target should be a simple fix; setting LIBXSMM_TARGET=aarch64 works perfectly fine. Is somebody working on that already?

The text was updated successfully, but these errors were encountered:

alheinecke · 2024-02-23T04:41:47Z

No, right now nobody is working on SVE128 or V2 support, simply as we don't have access to hardware. Any help on this welcome :-)

FreddieWitherden · 2024-03-03T20:55:33Z

This will likely require some work on the asparse_reg side too as we would want to consider both NEON and SVE128 kernels (as it is not clean that 128-bit SVE will outperform NEON whereas 256-bit SVE is quite clear cut).

breuera · 2024-03-04T08:38:48Z

Here's a full sweep for matrix-matrix mult. using ASIMD/Neon and FP32 for M,N,K in [1, ..., 64].
In the plots K is fixed to K=64 (raw data below).

Fixing N=6:

Fixing N=32:

Fixing N=64:

Overall the performance seems to be too low.
Achievable peak is at about 105 FP32 GFLOPS.
NVPL BLAS GEMM gets that (for way larger matrices, M,N,K >~1024).

We'll now have a look at how to write a fast assembly kernel for a single case.
This should help us optimizing the 128-bit JITters.
I'd also doubt that SVE128 probably helps much over Neon.

perf_grid_grace_sgemm_xsmm.csv

FreddieWitherden · 2024-03-04T15:44:01Z

Do we know the FMA issue rate and the latency for back to back accumulations? This will tell us how many accumulators we need to use and can compare that against what the routine currently uses.

breuera · 2024-03-08T09:24:59Z

I am only aware of the numbers in the software optimization guides: https://developer.arm.com/documentation/PJDOC-466751330-593177/latest/

alheinecke · 2024-03-08T09:27:33Z

I believe the ASIMD kernel is using in register broadcast, doesn't it? Perhaps the x86 style explicit broadcast and full SIMD FMA is faster?

stefan0re · 2024-03-08T10:09:37Z

Currently i have an assembly kernel for NEON on the Grace CPU that achieves 102 GFLOPS on small fixed-size matrix, closely approaching the peak performance of 105 GFLOPS.

FreddieWitherden · 2024-03-08T10:11:28Z

I believe the ASIMD kernel is using in register broadcast, doesn't it? Perhaps the x86 style explicit broadcast and full SIMD FMA is faster?

On A64FX I observed a heavy penalty for using the register broadcasting. However, the ARM-designed cores seem to handle this quite well.

breuera · 2024-03-08T10:14:29Z

Haven't seen bcasting issues on Neoverse either. But might be worth to double-check for V2.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Grace / V2 #863

Grace / V2 #863

breuera commented Feb 21, 2024

alheinecke commented Feb 23, 2024

FreddieWitherden commented Mar 3, 2024

breuera commented Mar 4, 2024

FreddieWitherden commented Mar 4, 2024 via email •

edited

breuera commented Mar 8, 2024

alheinecke commented Mar 8, 2024

stefan0re commented Mar 8, 2024

FreddieWitherden commented Mar 8, 2024

breuera commented Mar 8, 2024

Grace / V2 #863

Grace / V2 #863

Comments

breuera commented Feb 21, 2024

alheinecke commented Feb 23, 2024

FreddieWitherden commented Mar 3, 2024

breuera commented Mar 4, 2024

FreddieWitherden commented Mar 4, 2024 via email • edited

breuera commented Mar 8, 2024

alheinecke commented Mar 8, 2024

stefan0re commented Mar 8, 2024

FreddieWitherden commented Mar 8, 2024

breuera commented Mar 8, 2024

FreddieWitherden commented Mar 4, 2024 via email •

edited