Replies: 2 comments 2 replies
-
I don't know how HIP defines their
|
Beta Was this translation helpful? Give feedback.
-
The compiler seems to determine that the loads have no effect and it's free to reuse the same registers for all the assembly instructions, which makes sense. If I do some arithmetic operations with the variables, the registers allocated will be different. Thus, my next attempt is loading multiple values into multiple registers using multiple instructions a single statement of inline assembly. The compiler is now aware that the result should go to different registers. Unfortunately, the generated assembly is still incorrect.
The generated assembly is:
The first load instruction clobbers registers |
Beta Was this translation helpful? Give feedback.
-
Motivation
I'm doing some micro-benchmarks on AMD GPUs to understand its performance characteristics in order to improve kernel performance. I suspect different register allocation and instruction scheduling outcomes by the compiler may change the kernel's performance characteristics - the compiler attempts to interleave loads and computes, and it also attempts to conserve registers by loading new values as soon as a arithmetic instruction finishes. In some cases, I found there can be a notable performance difference, I suspect the reason is that it changes the number of simultaneous memory requests issued, causing a reduction of memory bandwidth - even the
load()
member function ofsycl::float4
is not immune from this kind of compiler optimization.Thus, I decided to use inline assembly when targeting AMD HIP to have better control of the micro-benchmarks.
Problem
The following kernel attempts to use the AMD GCN instruction
global_load_dwordx4
to load 4 floats from memory to AMD HIP'sfloat4
variable.Unfortunately the generated code is incorrect:
The 4 generated instructions load the data into the same VGPRs. It seems that the compiler has no idea about the nature of
tmp11, tmp12, tmp13, tmp14
.What is the correct inline assembly syntax for loading float4 from memory to local variables?
Beta Was this translation helpful? Give feedback.
All reactions