Matrix-Multiply-using-Arm-Neon

The repo is about neon based matrix multiplication on different data types like int16. int32, float32 and float64. And the performance on raspberry pi 4 arm64 is shown along with the code.

Requirements

 1.GCC >=10
 
 
 2.Arm architecture or Qemu emulator

Results

The code speed for neon arm is increased by 10X when using auto-vectorization (eg. gcc int16mul.c -O3 int16_result). When testing on larger matrices the difference between normal and neon mul is greater.

the code is checked on raspberry pi and in linux x86_64 using qemu emulator to know more about qemu emulator refer this link https://github.com/multiarch/qemu-user-static

It is better to use Arm Architecture instead of Qemu because the performance in Qemu is quite low since its emulated platform. (Eg. for 1000 x 1000 float32 in qemu the time taken is 49s whereas in arm architecture time is 10.4s).

When the code is Auto-vectorized the time is reduced to 1.4s in arm neon for 1k x 1k.

The time chart is given below

  Matrix size  int16(s)   int16neon(s)  int32(s)  int32neon(s) float32(s) float32neon(s) float64(s) float64neon(s) 
    8 x 8      0.000042    0.000039     0.000035   0.000022     0.000054    0.000027      0.000030    0.000027
   32 x 32     0.001439    0.001305     0.001502   0.001403     0.001559    0.000730      0.001686    0.008837
   50 x 50     0.005361    0.003472     0.005773   0.002039     0.005832    0.002198      0.005926    0.002301
  100 x 100    0.023119    0.016946     0.021783   0.007522     0.022988    0.008413      0.023099    0.011161
  200 x 200    0.158035    0.123986     0.149948   0.059083     0.159806    0.067440      0.163524    0.089471
  300 x 300    0.532597    0.435758     0.545451   0.199766     0.542208    0.228595      0.582677    0.319158
  400 x 400    1.314986    1.004088     1.367670   0.471891     1.397881    0.547197      1.725430    0.787738

For more reference about result and graph comparison please check to the result.pdf file in this repo

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
README.md		README.md
avx_float.c		avx_float.c
avx_int.c		avx_int.c
float32mul.c		float32mul.c
float64mul.c		float64mul.c
int16mul.c		int16mul.c
int32mul.c		int32mul.c
result.pdf		result.pdf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

avx_float.c

avx_float.c

avx_int.c

avx_int.c

float32mul.c

float32mul.c

float64mul.c

float64mul.c

int16mul.c

int16mul.c

int32mul.c

int32mul.c

result.pdf

result.pdf

Repository files navigation

Matrix-Multiply-using-Arm-Neon

About

Releases

Packages

Languages

ruthreshx/Matrix_Multiply_using_Arm_Neon_and_Avx

Folders and files

Latest commit

History

Repository files navigation

Matrix-Multiply-using-Arm-Neon

About

Topics

Resources

Stars

Watchers

Forks

Languages