Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

aciddelgado · 2024-03-13T15:26:31Z

I’ve discovered a performance gap between the Neural Speed Matmul operator and the Llama.cpp operator in the Neural-Speed repository. This issue was identified while running a benchmark with the ONNXRuntime-GenAI tool.
The ONNXRuntime-GenAI tool was used to run a CPU-based int4 version of Phi-2 that utilizes the MatmulNBits operator. The performance of this was then compared with the metrics from the Llama.cpp operator.
The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.
Upon profiling the MatmulNBits operator, it was identified as the bottleneck for the model. I will insert some performance metrics acquired with the onnxruntime profiling tool here for further analysis.

Past sequence length 29, Total sequence length 30

Name	Duration	Pct	Count	Cumulative Pct	Cumulative Dur
MatMulNBits	4132089	87.39	15826	87.39	4132089
MultiHeadAttention	289191	6.12	2624	93.50	4421280
Add	131205	2.77	16072	96.28	4552485
FastGelu	67065	1.42	2624	97.69	4619550

Past sequence length 128, Total sequence length 129

Name	Duration	Pct	Count	Cumulative Pct	Cumulative Dur
MatMulNBits	3882211	81.92	15440	81.92	3882211
MultiHeadAttention	576563	12.17	2560	94.08	4458774
Add	118635	2.50	15680	96.59	4577409
FastGelu	60107	1.27	2560	97.86	4637516

Past sequence length 512, Total sequence length 513

Name	Duration	Pct	Count	Cumulative Pct	Cumulative Dur
MatMulNBits	3054838	62.79	11773	62.79	3054838
MultiHeadAttention	1582324	32.53	1952	95.32	4637162
Add	98730	2.03	11956	97.35	4735892
FastGelu	48359	0.99	1952	98.34	4784251

This issue needs to be addressed to improve the performance of the Neural Speed Matmul operator and bring it up to par with the Llama.cpp operator.

luoyu-intel · 2024-03-14T01:10:21Z

Thanks for your report!
What's the accuracy level of this model's MatMulNBits?

yufenglee · 2024-03-14T18:15:44Z

Thanks for your report! What's the accuracy level of this model's MatMulNBits?

we use the fp32

luoyu-intel · 2024-03-15T05:19:21Z

I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8.

luoyu-intel · 2024-03-15T07:48:28Z

I've done some tests on 12900K. The latency result shows that NeuralSpeed(weight_dtype=int4, group_size=32, compute_dtype=int8) beats llama.cpp(phi-2.Q4_0.gguf).

The GenAI token generation throughput was measured at 13.699070483881153 transactions per second (tps), while the Llama.cpp operator achieved a higher throughput of 22.75 tps. These metrics are end-to-end.

How do you measure the 13.699070483881153 tps? Can you provide some steps to reproduce this tps?

yufenglee · 2024-03-27T18:30:25Z

This is the tool to get the benchmark number: https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python

yufenglee · 2024-03-27T18:31:11Z

I will measure the performance with NeuralSpeed and LLama.cpp. BTW, are you aware of that llama.cpp uses AVX_VNNI for computation which is equal to accuracy_level=COMP_INT8.

The target machine doesn't have avx_vnni and we tested int8+int4. The perf is similar to fp32+int4.

luoyu-intel · 2024-04-02T09:15:34Z

we will plan it as a client target enhancement.

luoyu-intel · 2024-04-09T06:21:33Z

This issue will be fixed in this PR: #209

yufenglee · 2024-04-09T19:59:24Z

[like] Yufeng Li reacted to your message:

…

________________________________ From: luoyu-intel ***@***.***> Sent: Tuesday, April 9, 2024 6:21:54 AM To: intel/neural-speed ***@***.***> Cc: Comment ***@***.***> Subject: Re: [intel/neural-speed] Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator (Issue #174) This issue will be fixed in this PR: #209<#209> — Reply to this email directly, view it on GitHub<#174 (comment)> or unsubscribe<https://github.com/notifications/unsubscribe-auth/AHITBNXN3Q6Z6UCL6V262BDY4OCIHBFKMF2HI4TJMJ2XIZLTSSBKK5TBNR2WLJDUOJ2WLJDOMFWWLO3UNBZGKYLEL5YGC4TUNFRWS4DBNZ2F6YLDORUXM2LUPGBKK5TBNR2WLJDUOJ2WLJDOMFWWLLTXMF2GG2C7MFRXI2LWNF2HTAVFOZQWY5LFUVUXG43VMWSG4YLNMWVXI2DSMVQWIX3UPFYGLAVFOZQWY5LFVI3DEMRYHEZDINBUGWSG4YLNMWUWQYLTL5WGCYTFNSWHG5LCNJSWG5C7OR4XAZNMJFZXG5LFINXW23LFNZ2KM5DPOBUWG44TQKSHI6LQMWVHEZLQN5ZWS5DPOJ42K5TBNR2WLKJXGIYDSNRYGAZDNAVEOR4XAZNFNFZXG5LFUV3GC3DVMWVDEMJYGQZDSNBSGIZYFJDUPFYGLJLMMFRGK3FFOZQWY5LFVI3DEMRYHEZDINBUGWTXI4TJM5TWK4VGMNZGKYLUMU>. You are receiving this email because you commented on the thread. Triage notifications on the go with GitHub Mobile for iOS<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

luoyu-intel · 2024-04-17T04:07:01Z

@yufenglee For AVX2 devices without AVX_VNNI instructions, GGML uses _mm256_maddubs_epi16 as a replacement. But this instruction has over-flow risk. The result of int8 * int8+int8 * int8 may be larger than the maximum value of int16. The result will be clipped. Are you willing to accept this instruction as a replacement of AVX_VNNI which could decrease accuracy?

For NBits lower than 8, it won't be a problem.

yufenglee · 2024-04-17T04:20:50Z

As it won’t be an issue for bits lower than 8 bits, it should be fine. We mainly use blockwise quantization for bits lower than 8.

luoyu-intel · 2024-05-07T01:28:03Z

According to this comment, this issue should have been fixed: #209 (comment)

luoyu-intel self-assigned this Mar 14, 2024

luoyu-intel mentioned this issue Mar 15, 2024

disable MHA_AVX2 #173

Merged

luoyu-intel added the enhancement New feature or request label Apr 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

aciddelgado commented Mar 13, 2024

luoyu-intel commented Mar 14, 2024

yufenglee commented Mar 14, 2024

luoyu-intel commented Mar 15, 2024

luoyu-intel commented Mar 15, 2024

yufenglee commented Mar 27, 2024

yufenglee commented Mar 27, 2024

luoyu-intel commented Apr 2, 2024

luoyu-intel commented Apr 9, 2024 •

edited

yufenglee commented Apr 9, 2024 via email

luoyu-intel commented Apr 17, 2024 •

edited

yufenglee commented Apr 17, 2024

luoyu-intel commented May 7, 2024

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

Performance Gap between Neural Speed Matmul Operator and Llama.cpp Operator #174

Comments

aciddelgado commented Mar 13, 2024

luoyu-intel commented Mar 14, 2024

yufenglee commented Mar 14, 2024

luoyu-intel commented Mar 15, 2024

luoyu-intel commented Mar 15, 2024

yufenglee commented Mar 27, 2024

yufenglee commented Mar 27, 2024

luoyu-intel commented Apr 2, 2024

luoyu-intel commented Apr 9, 2024 • edited

yufenglee commented Apr 9, 2024 via email

luoyu-intel commented Apr 17, 2024 • edited

yufenglee commented Apr 17, 2024

luoyu-intel commented May 7, 2024

luoyu-intel commented Apr 9, 2024 •

edited

luoyu-intel commented Apr 17, 2024 •

edited