New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hexagon HVX implementation of F32 vbinary #6386
base: master
Are you sure you want to change the base?
Conversation
scripts/generate-f32-vbinary.sh
Outdated
@@ -591,4 +591,18 @@ tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SQRDIFF -D BATCH_TILE=32 -D | |||
tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB -D BATCH_TILE=16 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u16.c & | |||
tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u32.c & | |||
|
|||
################################### ARM NEON ################################## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Stale comment?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, after I copied the command line examples from ARM NEON section, missed this line to change to HEXAGON HVX. Done!
scripts/generate-f32-vbinary.sh
Outdated
@@ -591,4 +591,18 @@ tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SQRDIFF -D BATCH_TILE=32 -D | |||
tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB -D BATCH_TILE=16 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u16.c & | |||
tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u32.c & | |||
|
|||
################################### HEXAGON HVX ################################## | |||
tools/xngen src/f32-vbinary/vop-hexagon.c.in -D OP=ADD -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vadd-minmax-hexagon-u32.c & |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the other kernels nearby, it seems like the kernel should be suffixed hvx
. If I understand correctly, hexagon
:x86
::hvx
:avx512f
(for example).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point and I will commit after making all related changes.
HVX_Vector *vptr_o = (HVX_Vector*) output; | ||
|
||
for (; batch >= 64 * sizeof(float); batch -= 64 * sizeof(float)) { | ||
HVX_Vector va0 = *vptr_a++; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the best way to load on HVX?
Would it be better to load all va values and then all vb values?
Would it be better to use offsets and then increment the pointers after?
Are there LDP or LD4 instructions like ARM?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed to load all va values then vb values for now. I will improve this part later using special instructions or offset.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Easy! Looks good overall.
const float* input_a, | ||
const float* input_b, | ||
float* output, | ||
const union xnn_f32_default_params params[restrict XNN_MIN_ELEMENTS(1)]) XNN_OOB_READS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you mask the remainder read so you can remove OOB_READs? (Out of Bounds)
HVX has a large over read, and the masked remainder wont normally be a performance bottleneck... the main loop will be. On AVX512, RiSCV and SVE I expect we'll move towards asan friendly reads using predicates.
@@ -591,4 +591,18 @@ tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SQRDIFF -D BATCH_TILE=32 -D | |||
tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB -D BATCH_TILE=16 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u16.c & | |||
tools/xngen src/f32-vbinary/vopc-avx512f.c.in -D OP=SUB -D BATCH_TILE=32 -D ACTIVATION=MINMAX -o src/f32-vbinary/gen/f32-vsubc-minmax-avx512f-u32.c & | |||
|
|||
################################### HEXAGON HVX ################################## |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing OP=DIV
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would ask you how fast it is, but we seem to be missing benchmarks for vbinary.
I'm adding x86 kernels for FP16, which you could refer to for what needs to be done.
#6394
For benchmark I used a real model and tflite benchmark_model with perf record/report to measure improvement:
Was AVX with VCVT
7.15% xnn_f16_vmulc_minmax_ukernel__f16c_u16
Now AVX512 FP16
2.94% xnn_f16_rmax_ukernel__avx512fp16_u128_acc4
But I'll try to make a benchmark we can use for F16 and F32 vbinary
The current PR should work with unaligned load as well without changing tester. I will work on the related issue now. |
…d other related code changes
…NEON to HEXAGON HVX in scripts/generate-f32-vbinary.sh
… several locations.
cd84b0a
to
579fb80
Compare
This PR still contains the 10k+ line additions of microkernels.bzl and microkernels.cmake. You need to revert those changes, and I also think you're missing the microkernels_hexagon.bzl changes. You shouldn't have to make these changes manually, I think you need to run:
|
Yeap, I realized that I modified the generated file which is clearly saying "Do not edit" :) I am working on this. One related question: We currently have 'hexagon' in ISA_LIST in update-microkernels.py but coming kernels having 'hvx' as isa_name instead of 'hexagon'. cs16-vsquareabs has hexagon kernels using scalar vectors instead of HVX vectors, so renaming cs16-vsqaureabs kernels with including 'hvx' seems not right. The simplest solution is to have both 'hvx' and 'hexagon' in the ISA_LIST and make related changes. This will generate kernels separately for HVX and Hexagon (non-HVX). If I add 'hexagon' in ARCH_LIST, it tries to generate asm_kernels and jit_kernels, so I want avoid this for now. Do you think this is okay? |
…crokernels.py to have hvx in isa_list as well.
We needed to adjust XNN_ALLOCATION_ALIGNMENT to 128 Byte for HVX and use predicate store for the tail part. Ctest passed but performance is not good yet. The next step naturally becomes the performance improvements.