Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blas_ API: for sgemm of armv8a, only 4x4 microkernel can be used? #133

Open
AnonymousYWL opened this issue Oct 21, 2020 · 9 comments
Open

Comments

@AnonymousYWL
Copy link

No description provided.

@giaf
Copy link
Owner

giaf commented Oct 24, 2020

Hi, where did you see that only the 4x4 microkernel can be used?
Actually, in the BLAS_API sgemm algorithm for ARM Cortex A53 and A57, the 8x8 kernel is implemented and used
https://github.com/giaf/blasfeo/blob/master/blasfeo_hp_cm/sgemm.c#L431

@AnonymousYWL
Copy link
Author

Does code size affect the performance of small GEMMs?

@giaf
Copy link
Owner

giaf commented Nov 14, 2020

In general I don't expect code side to affect the performance of small GEMMs much, at least once it is loaded in instruction cache, in case of multiple calls to GEMM routines.

@AnonymousYWL
Copy link
Author

What if it runs only once?

@giaf
Copy link
Owner

giaf commented Nov 14, 2020

Then for small matrices it may be that the overhead of loading data and code from main memory is the limiting factor.
But it is difficult to say a priori, you should benchmark/profile your application.

@hfp
Copy link

hfp commented Nov 16, 2020

( Some answers given here are also applicable here as well. )

@giaf
Copy link
Owner

giaf commented Nov 16, 2020

@hfp thanks for sharing the link to your issue, interesting reading!

@AnonymousYWL
Copy link
Author

Thank you for your previous reply. I would like to ask: is it reasonable to run multiple times and average the performance of small-scale GEMM?

@giaf
Copy link
Owner

giaf commented Nov 16, 2020

IMO it is, as this is a rather common case in practice.
In many cases, the (small) matrices are already in cache as the result of some previous operation, and the same appliers to the code.
In particular, in BLASFEO the "nano-kernels" are special functions shared between several linear algebra routines, so it is very likely that they keep being used and stay in cache.

On the other hand, you can always build an example where both code and data are cold.
At the end of the day, it depends on your specific application, and since you didn't share much information about it, it's up to you to judge it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants