Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rocblas thread safety and level3 initialization overhead #1405

Closed
jakub-homola opened this issue Mar 13, 2024 · 5 comments
Closed

rocblas thread safety and level3 initialization overhead #1405

jakub-homola opened this issue Mar 13, 2024 · 5 comments
Assignees

Comments

@jakub-homola
Copy link

jakub-homola commented Mar 13, 2024

Is rocblas thread safe? (or maybe rather "thread-aware"?)

Assuming I have different rocblas_handles, different streams for each of them, different data, but the same gpu. I did not find anything about it in the documentation.

I am having a strange issue with initialization, when I call a level3 rocblas function (syrk in my use case) for the first time in an OpenMP parallel region, the initialization seems to occur in each thread and the rocblas calls seem sequentialized, significantly slowing down the program. I use rocm-5.4.3, I don't have the ability to try newer.

I could elaborate more on the issue, but first I wanted to ask if it is even valid use, to call rocblas functions from different host threads concurrently.

@NaveenElumalaiAMD NaveenElumalaiAMD self-assigned this Mar 14, 2024
@NaveenElumalaiAMD
Copy link
Contributor

Hello @jakub-homola,
Thank you for reporting this issue. When you meant:

the initialization seems to occur in each thread

I suppose that you are taking about the call to rocblas_initialize(). The rocblas_initialize() call does not have to occur in each thread and it has to be called once outside the OpenMP parallel region.

Even after this change if you are still facing the performance drop, feel free to post the example code here and I will take a look at it.

Let me know if you have further questions.
Naveen

@jakub-homola
Copy link
Author

I don't explicitly call roclbas_initialize() in a parallel region. Here is the issue:

Now that I went throroughly through it, it might just be a problem with the installation I am using. But I cannot be sure as I don't have access to any different AMD GPU machine with the same rocm version. But I will report it anyway, I would appreciate if you could try and replicate it on 5.4.3 too to see if it's just a problem with my installation.

Describe the bug

If the first rocblas_syrk function call occurs from several threads concurrently, the time of the syrk call rises unexpectedly.

To Reproduce

Precise version of rocBLAS installed: the rocblas that comes with rocm-5.4.3. In rocblas-version.h I see version 2.46

Steps to reproduce the behavior:

Use the source.hip.cpp.txt file (remove the .txt extension). There, in the loop on lines 61-82 I create several streams, rocblas handles, allocate and initialize matrices. I also query for the workspace buffer size needed by the syrk function, and allocate and assign the memory to rocblas, to ensure asynchronous submit of the compute stage. In the next loop starting on line 87, I submit all the rocblas_dsyrk kernels and measure the time it took to submit. Finally I free the memory and destroy the handles and streams. All this is repeated 3 times.

Compile using hipcc -g -O2 -fopenmp --offload-arch=gfx90a:sramecc+:xnack- source.hip.cpp -o program.x -lrocblas and run.

(1) This was the sequential program that works as expected. See output_seq.txt. The first ever call to (the compute stage of) rocblas_dsyrk() takes a loger time (around 400 ms) due to some initializations, the rest of the calls are very quick, as they just submit the work to the gpu. The first repeat takes around 800 ms due to all the initializations, the remaining ones take only around 70 ms.

(2) Now, the (probably) buggy part. Uncomment the #pragma omp parallel on line 86 to make the loop parallel. The output I get is output_par_08.txt. There you can see, that in the first repeat, the rocblas_dsyrk calls take significantly longer amount of time, which significantly increases the total time of that repetition to over 2.2 s. From the timings it seems like if the 8 threads that run the first syrk concurrently did some initialization, which got sequentialized, as the syrk calls finish with approximately uniform gaps. The remaining two repetitions run in ~70 ms as expected, since everything is already initialized. Running the loop with 16 threads instead of 8 makes things even worse: output_par_16.txt

(3) I can work around this problem by calling the syrk from only one thread at a time, thus introducing the #pragma omp critical on line 93. The output then looks like this output_par_08_critical.txt. The first thread that encounters the syrk does the initializations. The running threads must wait for that first syrk submit to finish, so it takes ~400 ms for the first batch of syrks, but I am okay with that. The repetition times are approx. equal to the original sequential version.

(4) If I remove the critical region and keep the loop parallel (as in (2)), but now call rocblas_initialize() at the start of the program, the issue now also goes away and it seems to work as expected: output_par_init.txt. This again suggests that there is some issue with initialization being called in parallel. My issue with this is, that it takes way too long to initialize everything in rocblas so that it is not worth it for me.

This is just a demonstration, my real program is way more complicated, and being able to submit the syrk kernels in parallel allows me to do more cpu-gpu overlap.

Expected behavior

Similar to (3) with the critical, but where I don't have to use the critical region, as it should be somewhere on the inside of rocblas.

Log-files

Running (2) with ROCBLAS_LAYER=7: log.txt

Environment

LUMI supercomputer, accelerated compute node. AMD EPYC 7A53 64-Core Processor, AMD MI250X.

I am using rocm-5.4.3 manually installed by a colleague. With the officialy supported rocm-5.2.3 ... I just found out that it works just fine, and the initialization times are tiny. Okay, this might just be a problem with the installation, but I have no idea how to find out. Maybe the update made instruduced the issue. Could you please try to reproduce this?

With rocm/5.4.3 I see rocblas 2.46
With rocm/5.2.3 I see rocblas 2.44

environment.txt
lspci: command not found

Additional context

As I am now trying to experiment with different, but still not officially supported rocm versions, the core of the issue seems to continue even with rocm/5.7.1, although the timings are different now.

@rkamd
Copy link
Contributor

rkamd commented May 15, 2024

@jakub-homola ,
The thread safety is implemented using locks, hence you are noticing the high initialization time and sequential behavior for the first run. The initialization times have decreased in the latest versions of ROCm, but would still exhibit the same behavior as other threads must wait for the first thread to complete the initialization.

Summary of initialization times:

Threads # ROCm Ver Initialization time
8 5.4.3 2220 ms
8 6.1.1 838 ms
16 5.4.3 3725 ms
16 6.1.1 1142 ms

output_par_8.txt
output_par_16.txt

If the above initialization time seems reasonable for your use case, you could upgrade the ROCm version to 6.1.1.

@jakub-homola
Copy link
Author

@rkamd
Thanks for the reply.

I understand that the thread safety is implemented using locks, that seems reasonable.

The issue was, that if I use the locks myself instead of relying on the locks inside rocblas (that is the difference between (2) and (3)), then the runtimes are much more reasonable -- using the rocblas's locks had weird behavior, it seemed like the initialization happened again in each thread sequentially, instead of only in one.

Looking at the outputs you gathered, they look much more reasonable now with the new ROCm versions.

I would like to update to 6.1.1, unfortunately this is not up to me, but up to LUMI admins, which are appearently extremely conservative with updates.

I will test it when the new version will be available for me.

Anyway, according to your outputs, my main issue seems to be solved, so I will close this. Will reopen if necessary.

Thanks.

@cgmb
Copy link
Contributor

cgmb commented May 16, 2024

I'm not a rocBLAS/Tensile developer, however, I believe it is possible to make dramatic improvements to rocBLAS/Tensile initialization times, especially for cases like the one you've described. The scope of work required, however, is quite significant. If the initialization performance in ROCm 6.1 is still a pain point, please don't be afraid to say so. User feedback is important for prioritizing optimizations.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants