Enable A^T GEMM for BF16 #781

itaraban · 2023-06-14T13:35:57Z

Could you please enable such case in GEMM for BF16?
This is related to DGL project optimizations

hfp · 2023-06-14T13:46:35Z

Is there a source file in DGL referring to what's needed or a model/case that runs into this gap?

itaraban · 2023-06-14T13:54:01Z

In RCGN model DGL use segment_mm operation, which is currently implemented via torch(so it is super slow for cpu right now),
I prepared branch with LIBXSMM implementation - itaraban/dgl@cc03905
Which use same algorithm as CUDA version - https://github.com/dmlc/dgl/blob/master/src/array/cuda/gather_mm.cu#L202

It works pretty well for float and double, performance gain is up to 3x times for full model.
But I cannot use it for BF16, I got https://github.com/libxsmm/libxsmm/blob/main_stable/src/generator_gemm.c#L344.
In such case we will still use torch for BF16 and model will be 2 times slower than float version .

hfp · 2023-06-14T14:25:50Z

I checked, we do not have Bf16 TN-case or at least not tested/exercised, i.e., this seems to be a valid issue (beside of TN/A-transpose being an unfortunate case in the hot-path ;-).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable A^T GEMM for BF16 #781

Enable A^T GEMM for BF16 #781

itaraban commented Jun 14, 2023

hfp commented Jun 14, 2023

itaraban commented Jun 14, 2023

hfp commented Jun 14, 2023

Enable A^T GEMM for BF16 #781

Enable A^T GEMM for BF16 #781

Comments

itaraban commented Jun 14, 2023

hfp commented Jun 14, 2023

itaraban commented Jun 14, 2023

hfp commented Jun 14, 2023