You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Disclaimer: I don't know anything about gem5 internals, so this might be misdirected.
Currently the RVV permutation operations are modeled as VectorMiscOp, which doesn't really reflect reality.
Specifically vrgather.vv, vrgatherei16.vv and vcompress.vm are performance outliers, which isn't currently reflected in the model.
here are a few measurements, but to summarize, these are some throughput measurements:
vcompress.vm:
VLEN
e8m1
e8m2
e8m4
e8m8
c906
128
4
10
32
136
c908
128
4
10
32
139.4
c920
128
0.5
2.4
5.4
20.0
bobcat*
256
32
64
132
260
x280*
512
65
129
257
513
vrgather.vv:
VLEN
e8m1
e8m2
e8m4
e8m8
c906
128
4
16
64
256
c908
128
4
16
64.9
261.1
c920
128
0.5
2.4
8.0
32.0
bobcat*
256
68
132
260
516
x280*
512
65
129
257
513
*bobcat: note that it was explicitly stated, that they didn't optimize the permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a vrgather fast path for vl<=256. I think they didn't have much incentive to make this fast, as the x280 mostly targets AI.
I think that the C920 results are the most representative for what to expect of future desktop CPUs.
Personally, I suspect we'll see vrgather.vv perform well for any SEW under LMUL=1, and then grow exponential per element with higher LMUL in the best case, as an all to all mapping is quite expensive to scale.
vcompress.vm should be better scalable than vrgather.vv, since the work is subdividable, and I think we might see a range of implementations from similar to vrgather.vv to almost linear growth with LMUL.
The text was updated successfully, but these errors were encountered:
Disclaimer: I don't know anything about gem5 internals, so this might be misdirected.
Currently the RVV permutation operations are modeled as
VectorMiscOp
, which doesn't really reflect reality.Specifically
vrgather.vv
,vrgatherei16.vv
andvcompress.vm
are performance outliers, which isn't currently reflected in the model.here are a few measurements, but to summarize, these are some throughput measurements:
vcompress.vm
:vrgather.vv
:*bobcat: note that it was explicitly stated, that they didn't optimize the permutation instructions
*x280: the numbers are from llvm-mca, but I was told they match reality. There is also supposed to be a
vrgather
fast path forvl<=256
. I think they didn't have much incentive to make this fast, as the x280 mostly targets AI.I think that the C920 results are the most representative for what to expect of future desktop CPUs.
Personally, I suspect we'll see
vrgather.vv
perform well for anySEW
underLMUL=1
, and then grow exponential per element with higherLMUL
in the best case, as an all to all mapping is quite expensive to scale.vcompress.vm
should be better scalable thanvrgather.vv
, since the work is subdividable, and I think we might see a range of implementations from similar tovrgather.vv
to almost linear growth withLMUL
.The text was updated successfully, but these errors were encountered: