Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor DGEMM performance for armsve build on Neoverse N2 #641

Open
chrisgoodyer opened this issue Jul 8, 2022 · 4 comments
Open

Poor DGEMM performance for armsve build on Neoverse N2 #641

chrisgoodyer opened this issue Jul 8, 2022 · 4 comments

Comments

@chrisgoodyer
Copy link

Hi.

Whilst doing some comparative benchmarking on the Alibaba Cloud g8m instances I've run into some BLIS performance issues. g8m is based on Arm's Neoverse N2 technology and has 2x128-bit SVE vectors.

When I've done a build for the target "armsve" I am getting a peak performance of between 5 and 6 GFLOPs on a single core rather than the 20 GFLOPs I get from the Neon implementation.

There seems to be an awful lot of time spent in the function "bli_dpackm_mrxk_armsve_ref" which makes me think it is packing incorrectly for the 128-bit vector length. Running on AWS Graviton3 instances (with a 256-bit vector length) does not show these issues.

Thanks.

Chris

@devinamatthews
Copy link
Member

I think, of the currently-available configs, that ThunderX2 should perform best on N2. The SVE kernels are tuned for 256+ bit so I think you really want a neon kernel. A "real" Neoverse N1 kernel/configuration should be in master shortly.

@jlinford
Copy link

jlinford commented Jul 8, 2022

Good to hear about the N1 kernel coming to master. I also suggest building a 4x128 NEON kernel on the Neoverse V1 (AWS Graviton3). For GEMM, I don't see SVE128 having a significant advantage over NEON128. If you build a kernel that can feed four NEON SIMD units it should run very well on all known Arm server-class CPUs, even if they don't have wide SVE units.

@jdiamondGitHub
Copy link
Member

jdiamondGitHub commented Jul 8, 2022 via email

@xrq-phys
Copy link
Collaborator

Apologies for this late response.

For Graviton 3, 2xSVE256 does better than 4xNEON by about 2% or so.

armsve is not suitable for 128-bit due to its lack of indexed FMA that would decrease assembly capacity for instruction latency, but 5~6 GFLOPS is unexpected (should be ~15.). A possible reason here is that your Neoverse N2 core does not implement hardware prefetching which is presumed for kernels/armsve. I do not know how Alibaba Cloud differs from like Amazon C7g and Oracle Ampere, but using NEON ones should be good for your machine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants