blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used? #133

AnonymousYWL · 2020-10-21T12:19:46Z

No description provided.

giaf · 2020-10-24T21:46:17Z

Hi, where did you see that only the 4x4 microkernel can be used?
Actually, in the BLAS_API sgemm algorithm for ARM Cortex A53 and A57, the 8x8 kernel is implemented and used
https://github.com/giaf/blasfeo/blob/master/blasfeo_hp_cm/sgemm.c#L431

AnonymousYWL · 2020-11-14T03:02:48Z

Does code size affect the performance of small GEMMs?

giaf · 2020-11-14T11:27:56Z

In general I don't expect code side to affect the performance of small GEMMs much, at least once it is loaded in instruction cache, in case of multiple calls to GEMM routines.

AnonymousYWL · 2020-11-14T11:30:01Z

What if it runs only once?

giaf · 2020-11-14T18:53:11Z

Then for small matrices it may be that the overhead of loading data and code from main memory is the limiting factor.
But it is difficult to say a priori, you should benchmark/profile your application.

hfp · 2020-11-16T08:03:16Z

( Some answers given here are also applicable here as well. )

giaf · 2020-11-16T08:32:17Z

@hfp thanks for sharing the link to your issue, interesting reading!

AnonymousYWL · 2020-11-16T13:28:06Z

Thank you for your previous reply. I would like to ask: is it reasonable to run multiple times and average the performance of small-scale GEMM?

giaf · 2020-11-16T22:44:26Z

IMO it is, as this is a rather common case in practice.
In many cases, the (small) matrices are already in cache as the result of some previous operation, and the same appliers to the code.
In particular, in BLASFEO the "nano-kernels" are special functions shared between several linear algebra routines, so it is very likely that they keep being used and stay in cache.

On the other hand, you can always build an example where both code and data are cold.
At the end of the day, it depends on your specific application, and since you didn't share much information about it, it's up to you to judge it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used? #133

blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used? #133

AnonymousYWL commented Oct 21, 2020

giaf commented Oct 24, 2020

AnonymousYWL commented Nov 14, 2020

giaf commented Nov 14, 2020

AnonymousYWL commented Nov 14, 2020

giaf commented Nov 14, 2020

hfp commented Nov 16, 2020

giaf commented Nov 16, 2020

AnonymousYWL commented Nov 16, 2020

giaf commented Nov 16, 2020

blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used? #133

blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used? #133

Comments

AnonymousYWL commented Oct 21, 2020

giaf commented Oct 24, 2020

AnonymousYWL commented Nov 14, 2020

giaf commented Nov 14, 2020

AnonymousYWL commented Nov 14, 2020

giaf commented Nov 14, 2020

hfp commented Nov 16, 2020

giaf commented Nov 16, 2020

AnonymousYWL commented Nov 16, 2020

giaf commented Nov 16, 2020