New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added SVE128 support for GEMMs #873
base: main
Are you sure you want to change the base?
Conversation
stefan0re
commented
Mar 21, 2024
•
edited
edited
- Introduced SVE128 support for GEMM operations
- Added support for FP32, FP64, BF16, I8
- Performance is the same compared to NEON implementation (as expected, this will be improved in the future)
- hash.c test failure, same behavior with LIBXSMM_TARGET=aarch64 flag (just this test: tests/hash.c:84)
- some xgemm tests fail due to lack of support for some data types
tests/hash.c:84 this test fails
Thanks for the PR, I still see some important work, before we can merge, e.g. auto decection of V2 and fixing all the issues you raised. Can you please add some functionality the we can unit test on GVT3, then we can create a branch and go from there, e.g. by using: |
used k unrolling with element access on B
added st1 instruction
with LIBXSMM_TARGET=aarch64 there is fast NEON code for Neoverse V2 (FP32 and FP64)
This new implementation is faster than the previous ASIMD/Neon kernel on NVIDIA Grace(FP32, FP64), the main changes are:
This is a plot of the FP32 performance, K is fixed to K=48, and N is fixed to N=40 (single core) |
Since old version blocks N using multiples of 6 and the new one by 5: Can you also share results for N=36? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments
src/generator_aarch64_instructions.c
Outdated
@@ -552,7 +560,7 @@ void libxsmm_aarch64_instruction_asimd_struct_r_move( libxsmm_generated_code* | |||
code[code_head] |= (unsigned int)((0x1 & (unsigned int)i_tupletype) << 30); | |||
|
|||
/* load/store with offset register */ | |||
if ( (i_vmove_instr & 0x3) == 0x3 ) { | |||
if ( (i_vmove_instr & 0x3) == 0x3 && ((i_vmove_instr == LIBXSMM_AARCH64_INSTR_ASIMD_LD1R) || (i_vmove_instr == LIBXSMM_AARCH64_INSTR_ASIMD_LD1R_R_POST))) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you please rework such that we don't test for full instructions types, but more class of instructions to keep the code generator from be coming to convoluted and performance bottle necked by instruction specific if conditions.
src/generator_aarch64_instructions.h
Outdated
@@ -260,6 +260,14 @@ | |||
#define LIBXSMM_AARCH64_INSTR_ASIMD_LD1R_R_POST 0x0dc0c003 | |||
#define LIBXSMM_AARCH64_INSTR_ASIMD_LD1_I_POST 0x0ddf8002 | |||
#define LIBXSMM_AARCH64_INSTR_ASIMD_LD1_R_POST 0x0dc08003 | |||
#define LIBXSMM_AARCH64_INSTR_ASIMD_LD1_4 0x0c402000 // loads 4 values to vector register |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use C and not C++ comments
@@ -1384,14 +1411,36 @@ void libxsmm_generator_store_2dregblock_aarch64_asimd( libxsmm_generated_code* i | |||
|
|||
/* start register of accumulator */ | |||
l_vec_reg_acc_start = i_vec_reg_count - (i_n_blocking * l_m_total_blocks); | |||
/* set store instruction */ | |||
if( l_m_blocks[0] == 4 ){ | |||
l_a_store_instruction = LIBXSMM_AARCH64_INSTR_ASIMD_ST1_4; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this still work with v1 gemm kernles
src/generator_gemm_aarch64.c
Outdated
@@ -1113,16 +1246,17 @@ void libxsmm_generator_gemm_aarch64_kloop( libxsmm_generated_code* io | |||
l_k_stride = 4; | |||
} | |||
} | |||
|
|||
// TODO: implement new neoverse_v2 kernel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use C comment
src/generator_gemm_common_aarch64.c
Outdated
@@ -344,7 +344,7 @@ void libxsmm_generator_gemm_vnni_store_C_from_scratch_aarch64( libxsmm_generated | |||
libxsmm_aarch64_instruction_alu_compute_imm12( io_generated_code, LIBXSMM_AARCH64_INSTR_GP_ADD_I, LIBXSMM_AARCH64_GP_REG_XSP, LIBXSMM_AARCH64_GP_REG_X0, 0, 0 ); | |||
libxsmm_aarch64_instruction_alu_move( io_generated_code, LIBXSMM_AARCH64_INSTR_GP_STR_I_OFF, LIBXSMM_AARCH64_GP_REG_XSP, LIBXSMM_AARCH64_GP_REG_XZR, 64, i_gp_reg_mapping->gp_reg_c); | |||
libxsmm_aarch64_instruction_alu_move( io_generated_code, LIBXSMM_AARCH64_INSTR_GP_STR_I_OFF, LIBXSMM_AARCH64_GP_REG_XSP, LIBXSMM_AARCH64_GP_REG_XZR, 32, l_gp_reg_in); | |||
if ( libxsmm_cpuid_arm_use_bfdot() == 0 ) { | |||
if ( libxsmm_cpuid_arm_use_bfdot() == 0 ) { // TODO: check for SVE128 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use C style comment
src/generator_mateltwise_aarch64.c
Outdated
@@ -230,7 +232,7 @@ libxsmm_blasint libxsmm_generator_mateltwise_aarch64_valid_arch_precision( libxs | |||
LIBXSMM_DATATYPE_I64 == libxsmm_meltw_getenum_precision(i_mateltwise_desc, LIBXSMM_MELTW_FIELD_COMP) ) { | |||
is_valid_arch_prec = 0; | |||
} | |||
} | |||
} // TODO: check for SVE128 Support!! add -> (&& (io_generated_code->arch != LIBXSMM_AARCH64_SVE128) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use C style comment
src/generator_matequation_aarch64.c
Outdated
@@ -416,7 +416,7 @@ libxsmm_blasint libxsmm_generator_matequation_aarch64_valid_arch_precision( libx | |||
/* Binary not supported for fp64 */ | |||
libxsmm_meltw_binary_type non_fp64_binary[2] = { LIBXSMM_MELTW_TYPE_BINARY_MUL_AND_REDUCE_TO_SCALAR_OP_ADD, | |||
LIBXSMM_MELTW_TYPE_BINARY_ZIP }; | |||
|
|||
// TODO: check for SVE128! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please use C style comment