Hexagon HVX implementation of F32 vbinary #6386

ejparkqc · 2024-05-07T22:55:20Z

We needed to adjust XNN_ALLOCATION_ALIGNMENT to 128 Byte for HVX and use predicate store for the tail part. Ctest passed but performance is not good yet. The next step naturally becomes the performance improvements.

dsharlet · 2024-05-07T22:56:50Z

scripts/generate-f32-vbinary.sh

@@ -591,4 +591,18 @@ tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SQRDIFF  -D BATCH_TILE=32 -D
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=16 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u16.c &
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u32.c &

+################################### ARM NEON ##################################


Stale comment?

Yes, after I copied the command line examples from ARM NEON section, missed this line to change to HEXAGON HVX. Done!

dsharlet · 2024-05-07T23:06:04Z

scripts/generate-f32-vbinary.sh

@@ -591,4 +591,18 @@ tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SQRDIFF  -D BATCH_TILE=32 -D
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=16 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u16.c &
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u32.c &

+################################### HEXAGON HVX ##################################
+tools/xngen src/f32-vbinary/vop-hexagon.c.in -D OP=ADD     -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vadd-minmax-hexagon-u32.c &


Based on the other kernels nearby, it seems like the kernel should be suffixed hvx. If I understand correctly, hexagon:x86::hvx:avx512f (for example).

Good point and I will commit after making all related changes.

fbarchard · 2024-05-09T00:06:53Z

src/f32-vbinary/gen/f32-vmax-hvx-u64.c

+  HVX_Vector *vptr_o = (HVX_Vector*) output;
+
+  for (; batch >= 64 * sizeof(float); batch -= 64 * sizeof(float)) {
+    HVX_Vector va0 = *vptr_a++;


Is this the best way to load on HVX?
Would it be better to load all va values and then all vb values?
Would it be better to use offsets and then increment the pointers after?
Are there LDP or LD4 instructions like ARM?

I changed to load all va values then vb values for now. I will improve this part later using special instructions or offset.

fbarchard

Easy! Looks good overall.

fbarchard · 2024-05-09T00:09:29Z

src/f32-vbinary/gen/f32-vmax-hvx-u64.c

+    const float* input_a,
+    const float* input_b,
+    float* output,
+    const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)]) XNN_OOB_READS


Could you mask the remainder read so you can remove OOB_READs? (Out of Bounds)
HVX has a large over read, and the masked remainder wont normally be a performance bottleneck... the main loop will be. On AVX512, RiSCV and SVE I expect we'll move towards asan friendly reads using predicates.

fbarchard · 2024-05-09T19:19:34Z

scripts/generate-f32-vbinary.sh

@@ -591,4 +591,18 @@ tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SQRDIFF  -D BATCH_TILE=32 -D
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=16 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u16.c &
 tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB      -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u32.c &

+################################### HEXAGON HVX ##################################


missing OP=DIV

fbarchard

I would ask you how fast it is, but we seem to be missing benchmarks for vbinary.

I'm adding x86 kernels for FP16, which you could refer to for what needs to be done.
#6394

For benchmark I used a real model and tflite benchmark_model with perf record/report to measure improvement:

Was AVX with VCVT
7.15% xnn_f16_vmulc_minmax_ukernel__f16c_u16
Now AVX512 FP16
2.94% xnn_f16_rmax_ukernel__avx512fp16_u128_acc4

But I'll try to make a benchmark we can use for F16 and F32 vbinary

ejparkqc · 2024-05-23T01:40:49Z

I would ask you how fast it is, but we seem to be missing benchmarks for vbinary.

I'm adding x86 kernels for FP16, which you could refer to for what needs to be done. #6394

For benchmark I used a real model and tflite benchmark_model with perf record/report to measure improvement:

Was AVX with VCVT 7.15% xnn_f16_vmulc_minmax_ukernel__f16c_u16 Now AVX512 FP16 2.94% xnn_f16_rmax_ukernel__avx512fp16_u128_acc4

But I'll try to make a benchmark we can use for F16 and F32 vbinary

The current PR should work with unaligned load as well without changing tester. I will work on the related issue now.

…d other related code changes

…diff

…ot good yet.

…NEON to HEXAGON HVX in scripts/generate-f32-vbinary.sh

… several locations.

dsharlet · 2024-05-23T20:25:39Z

This PR still contains the 10k+ line additions of microkernels.bzl and microkernels.cmake. You need to revert those changes, and I also think you're missing the microkernels_hexagon.bzl changes. You shouldn't have to make these changes manually, I think you need to run:

python3 tools/update-microkernels.py -a

ejparkqc · 2024-05-23T21:03:33Z

This PR still contains the 10k+ line additions of microkernels.bzl and microkernels.cmake. You need to revert those changes, and I also think you're missing the microkernels_hexagon.bzl changes. You shouldn't have to make these changes manually, I think you need to run:

python3 tools/update-microkernels.py -a

Yeap, I realized that I modified the generated file which is clearly saying "Do not edit" :) I am working on this.
Also, I will revert the changes for microkernels.bzl and microkernels.cmake.

One related question: We currently have 'hexagon' in ISA_LIST in update-microkernels.py but coming kernels having 'hvx' as isa_name instead of 'hexagon'. cs16-vsquareabs has hexagon kernels using scalar vectors instead of HVX vectors, so renaming cs16-vsqaureabs kernels with including 'hvx' seems not right.

The simplest solution is to have both 'hvx' and 'hexagon' in the ISA_LIST and make related changes. This will generate kernels separately for HVX and Hexagon (non-HVX). If I add 'hexagon' in ARCH_LIST, it tries to generate asm_kernels and jit_kernels, so I want avoid this for now. Do you think this is okay?

…crokernels.py to have hvx in isa_list as well.

ejparkqc changed the title ~~F32 vbinary~~ Hexagon HVX implementation of F32 vbinary May 7, 2024

dsharlet reviewed May 7, 2024

View reviewed changes

fbarchard reviewed May 9, 2024

View reviewed changes

ejparkqc force-pushed the f32-vbinary branch from 7abb76a to a58660a Compare May 23, 2024 01:31

ejparkqc added 12 commits May 23, 2024 10:53

Changes to build XNNPACK for Hexagon with newer version of Hexagon SDK

3663ae8

Applying Dillon's suggestions and additional changes to build further

c91f7cc

Hexagon vector implementation for F32-vbinary

e7776d3

Add modified scripts to generate f32-vbinary for hexagon

146cdb0

Add modified scripts to generate f32-vbinary for hexagon

45ff7df

Additional change to compile and run the generated kernels.

863fa35

Add f32-vadd-minmax HVX implementation, generated kernels, scripts an…

7e99565

…d other related code changes

Add f32-vmul-minmax, f32-vsub-minmax, f32-vmax, f32-vmin and f32-vsqr…

11e7be9

…diff

The working verisons for all f32-vbinary operators - performance is n…

9cfe239

…ot good yet.

Need to change the beginning of the hexagon generation part from ARM …

8bfbd76

…NEON to HEXAGON HVX in scripts/generate-f32-vbinary.sh

Following the namining rules, needed to change from hexagon to hvx in…

8bc78c2

… several locations.

Supporting unaligned load

77ec639

ejparkqc force-pushed the f32-vbinary branch 2 times, most recently from cd84b0a to 579fb80 Compare May 23, 2024 20:17

ejparkqc added 3 commits May 23, 2024 15:25

Add conditional store implementation from Halide

fa48af2

Removing white space

4fcc39f

Follow the new location to list the Hexagon kernels

746d6e3

ejparkqc force-pushed the f32-vbinary branch from 579fb80 to 746d6e3 Compare May 23, 2024 20:26

Revert cmake/microkernels.cmake andmicrokernels.bzl. Modify update-mi…

d9aa1aa

…crokernels.py to have hvx in isa_list as well.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hexagon HVX implementation of F32 vbinary #6386

Hexagon HVX implementation of F32 vbinary #6386

ejparkqc commented May 7, 2024

dsharlet May 7, 2024

ejparkqc May 8, 2024

dsharlet May 7, 2024

ejparkqc May 8, 2024

fbarchard May 9, 2024

ejparkqc May 23, 2024

fbarchard left a comment

fbarchard May 9, 2024

fbarchard May 9, 2024

fbarchard left a comment

ejparkqc commented May 23, 2024

dsharlet commented May 23, 2024

ejparkqc commented May 23, 2024

Hexagon HVX implementation of F32 vbinary #6386

Are you sure you want to change the base?

Hexagon HVX implementation of F32 vbinary #6386

Conversation

ejparkqc commented May 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbarchard left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbarchard left a comment

Choose a reason for hiding this comment

ejparkqc commented May 23, 2024

dsharlet commented May 23, 2024

ejparkqc commented May 23, 2024