Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于测试浮点峰值的问题 #15

Open
jeezrick opened this issue Sep 2, 2022 · 2 comments
Open

关于测试浮点峰值的问题 #15

jeezrick opened this issue Sep 2, 2022 · 2 comments

Comments

@jeezrick
Copy link

jeezrick commented Sep 2, 2022

image
我现在跑的芯片型号是NVIDIA,ARMv8 Processor rev 0 (v8l)。
我看知乎文章里说测试浮点峰值时FMA指令的排布数量 = FMA的发射数 * FMA指令的延迟。我并没有查到上面这个芯片的手册。但是我看了A57的手册,里面是这样记录的:
image
FMA指令的延迟是10,吞吐量是2。我不太清楚这个吞吐是否代表着芯片可以同时发射两条FMA指令(是芯片发射吗),但是我分别放置了10条FMA指令(OP_FLOATS = 80)和20条FMA指令(OP_FLOATS = 160)都测试了,发现在10条的时候是16.095492 GFLOPS, 20条是 18.759214 GFLOPS。这是什么原因呢?
我的猜测有两个:
1.10条FMA指令确实不是测试这款芯片的浮点峰值所需要的指令数。
2.可能编译器自动开启了多线程?这个比较有可能,因为从4条指令到10条指令性能差不多翻倍,但是10-20只增加了一点。

@tpoisonooo
Copy link
Owner

  1. 吞吐 2 的意思就是双发射
  2. 延迟是 10, 双发射,那么至少应该放 10*2 + 2 条? 只放 10 条, 5 个 cycle 发射完了,再过 5 cycle 第一条执行完了,中间有 bubble.

@tpoisonooo
Copy link
Owner

AKA 要把它 “塞满”。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants