How do I Load Float4 from Memory to Register using Inline AMD GCN assembly? #1138

biergaizi · 2023-09-16T13:58:46Z

biergaizi
Sep 16, 2023

Motivation

I'm doing some micro-benchmarks on AMD GPUs to understand its performance characteristics in order to improve kernel performance. I suspect different register allocation and instruction scheduling outcomes by the compiler may change the kernel's performance characteristics - the compiler attempts to interleave loads and computes, and it also attempts to conserve registers by loading new values as soon as a arithmetic instruction finishes. In some cases, I found there can be a notable performance difference, I suspect the reason is that it changes the number of simultaneous memory requests issued, causing a reduction of memory bandwidth - even the load() member function of sycl::float4 is not immune from this kind of compiler optimization.

Thus, I decided to use inline assembly when targeting AMD HIP to have better control of the micro-benchmarks.

Problem

The following kernel attempts to use the AMD GCN instruction global_load_dwordx4 to load 4 floats from memory to AMD HIP's float4 variable.

void kernel_read_strided(
        float* __restrict a,
        sycl::nd_item<1> item,
        const sycl::local_accessor<sycl::float4, 1>& lds
)
{
        uint32_t base = item.get_global_id()[0] * FLOAT_PER_WORKITEM;

        __hipsycl_if_target_hip(
                float* a_ptr = &a[base];
                float4 tmp11, tmp12, tmp13, tmp14;

                asm volatile(
                        "global_load_dwordx4 %0, %1, off\n\t"
                        : "=v" (tmp11)
                        : "v" (a_ptr)
                );
                asm volatile(
                        "global_load_dwordx4 %0, %1, off, offset:16\n\t"
                        : "=v" (tmp12)
                        : "v" (a_ptr)
                );
                asm volatile(
                        "global_load_dwordx4 %0, %1, off, offset:32\n\t"
                        : "=v" (tmp13)
                        : "v" (a_ptr)
                );
                asm volatile(
                        "global_load_dwordx4 %0, %1, off, offset:48\n\t"
                        "s_waitcnt vmcnt(0)"
                        : "=v" (tmp14)
                        : "v" (a_ptr)
                );
}

Unfortunately the generated code is incorrect:

global_load_dwordx4 v[0:3], v[4:5], off
global_load_dwordx4 v[0:3], v[4:5], off, offset:16
global_load_dwordx4 v[0:3], v[4:5], off, offset:32
global_load_dwordx4 v[0:3], v[4:5], off, offset:48

The 4 generated instructions load the data into the same VGPRs. It seems that the compiler has no idea about the nature of tmp11, tmp12, tmp13, tmp14.

What is the correct inline assembly syntax for loading float4 from memory to local variables?

illuhad · 2023-09-16T20:04:56Z

illuhad
Sep 16, 2023
Maintainer

I don't know how HIP defines their float4, but sycl::float4 is well-known to be an overcomplicated mess as defined by the specification. I wouldn't be surprised if the compiler does not understand the semantics here. Maybe try directly with the following:

typedef __attribute__((__ext_vector_type__(4))) float my_float4;

2 replies

biergaizi Sep 17, 2023
Author

The original problem remains unchanged. The generated code doesn't change.

biergaizi Sep 17, 2023
Author

Upon further inspection, it seems to be a register allocation and optimization problem. The compiler determines that the loads have no effect and it's free to reuse the same registers for all the instructions, which makes sense. If I load data into multiple registers using multiple instructions a single statement of inline assembly, the compiler is now aware that the result should go to different registers. Unfortunately, the generated assembly is still incorrect, the source address register will be overwritten in the middle of loading...

biergaizi · 2023-09-17T05:51:02Z

biergaizi
Sep 17, 2023
Author

The compiler seems to determine that the loads have no effect and it's free to reuse the same registers for all the assembly instructions, which makes sense. If I do some arithmetic operations with the variables, the registers allocated will be different.

Thus, my next attempt is loading multiple values into multiple registers using multiple instructions a single statement of inline assembly. The compiler is now aware that the result should go to different registers. Unfortunately, the generated assembly is still incorrect.

typedef __attribute__((__ext_vector_type__(4))) float clang_float4;

void kernel_read_strided(
        float* __restrict a,
        sycl::nd_item<1> item,
        const sycl::local_accessor<sycl::float4, 1>& lds
)
{
        uint32_t base = item.get_global_id()[0] * FLOAT_PER_WORKITEM;

        __hipsycl_if_target_hip(
                float* a_ptr = &a[base];
                clang_float4 tmp11, tmp12, tmp13, tmp14;

                asm volatile(
                        "global_load_dwordx4 %0,  %4, off\n\t"
                        "global_load_dwordx4 %1,  %4, off offset:16\n\t"
                        "global_load_dwordx4 %2,  %4, off offset:32\n\t"
                        "global_load_dwordx4 %3,  %4, off offset:48\n\t"
                        "s_waitcnt vmcnt(0)"
                        : "=v" (tmp11), "=v" (tmp12), "=v" (tmp13), "=v" (tmp14)
                        : "v" (a_ptr)
                );
}

The generated assembly is:

        global_load_dwordx4 v[0:3],  v[0:1], off
        global_load_dwordx4 v[4:7],  v[0:1], off offset:16
        global_load_dwordx4 v[8:11],  v[0:1], off offset:32
        global_load_dwordx4 v[12:15],  v[0:1], off offset:48

The first load instruction clobbers registers v[0:1] so all the subsequent loads would not work as expected.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How do I Load Float4 from Memory to Register using Inline AMD GCN assembly? #1138

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

How do I Load Float4 from Memory to Register using Inline AMD GCN assembly? #1138

biergaizi Sep 16, 2023

Motivation

Problem

Replies: 2 comments · 2 replies

illuhad Sep 16, 2023 Maintainer

biergaizi Sep 17, 2023 Author

biergaizi Sep 17, 2023 Author

biergaizi Sep 17, 2023 Author

biergaizi
Sep 16, 2023

Replies: 2 comments 2 replies

illuhad
Sep 16, 2023
Maintainer

biergaizi Sep 17, 2023
Author

biergaizi Sep 17, 2023
Author

biergaizi
Sep 17, 2023
Author