Overlap creation of jacobian matrix with GPU data transfers #5256

razvnane · 2024-03-15T10:43:25Z

This PR optimizes the memory operations time by overlapping the creation (which includes a host memory copy) of the jacobian matrix used in the blockJacobi ILU with copying the matrix data to the GPU. For NORNE this gives a 17% reduction in the total transfer/copy time, while for the bigmodel this reduction is around 50% on a system with a AMD EPYC 7763 64-Core Processor + AMD MI210 GPU / NVIDIA A100 GPU.

multitalentloes · 2024-04-02T09:45:41Z

opm/simulators/linalg/ISTLSolverBda.cpp

+            copyThread = std::make_shared<std::thread>([&](){this->copyMatToBlockJac(matrix, *blockJacobiForGPUILU0_);});
+
            // Const_cast needed since the CUDA stuff overwrites values for better matrix condition..
            bridge_->solve_system(&matrix, blockJacobiForGPUILU0_.get(),
                                  numJacobiBlocks_, rhs, *wellContribs, result);


Slightly confusing to see the blockJacobiForGPUILU0_ be used, and written to by solve_system, while the copyThread is writing to it. I think it is correct still, even with the replaceZeroDiagonal() in solve_system because either the copy thread or the main thread will write to a certain index in it, but never both... This required some effort to spot so maybe add a comment with why this works to make it more readable.

I added a comment, hopefully it is more clear now. Thanks for pointing this out.

multitalentloes · 2024-04-02T11:12:26Z

OPM-simulators mainly uses OMP for multithreading computation/memory transfers. For this reason it might be nice to use opm_get_max_threads()>1 to check whether or not we want to use the multithreaded parallelization of memory transfers proposed in this PR. Even though we do not use OMP here it could be nice to either consistently use the multithreaded option if the user wants it through OMP, or consistently use single-threaded code.

blattms · 2024-04-02T19:28:39Z

opm/simulators/linalg/ISTLSolverBda.cpp

@@ -43,6 +43,10 @@

 #include <opm/grid/polyhedralgrid.hh>

+#include <thread>
+
+std::shared_ptr<std::thread> copyThread;


Isn't there a way without the global variable, e.g. passing it around to the method where it is joined?

Yes that is possible, but that would mean to pass it down via solve_system method, so modifying the solvers interface by adding something that is not really related to the linear system data, which I like even less than the global variable. But if this is a strong objection, I modify the code to pass it around, unless there are also other options that I am not aware of.

razvnane · 2024-04-03T13:39:32Z

OPM-simulators mainly uses OMP for multithreading computation/memory transfers. For this reason it might be nice to use opm_get_max_threads()>1 to check whether or not we want to use the multithreaded parallelization of memory transfers proposed in this PR. Even though we do not use OMP here it could be nice to either consistently use the multithreaded option if the user wants it through OMP, or consistently use single-threaded code.

Ok, I added pragmas to enable multithreaded copy only when OMP is found.

bska · 2024-04-07T18:50:15Z

jenkins build this please

multitalentloes · 2024-04-11T11:24:15Z

The change with the OpenMP processor stuff is almost what I had in mind. It is still possible that the user has OpenMP available, but has explicitly run flow with --threads-per-process=1, is it reasonable in this case to create this extra copy thread? In this case I personally think its best to keep the serial implementation, and if the user wants more than one thread, then go ahead and to this optimization since the user is intending to use multithreading. So basically wrapping the threaded code with an if statement on the number of threads available being > 1.

razvnane · 2024-04-12T18:22:27Z

The change with the OpenMP processor stuff is almost what I had in mind. It is still possible that the user has OpenMP available, but has explicitly run flow with --threads-per-process=1, is it reasonable in this case to create this extra copy thread? In this case I personally think its best to keep the serial implementation, and if the user wants more than one thread, then go ahead and to this optimization since the user is intending to use multithreading. So basically wrapping the threaded code with an if statement on the number of threads available being > 1.

Done.

multitalentloes · 2024-04-15T12:10:55Z

This looks good to me.

bska · 2024-04-15T12:13:29Z

jenkins build this please

OPT: overlap create jacMat with copy to GPU

411a397

multitalentloes reviewed Apr 2, 2024

View reviewed changes

blattms reviewed Apr 2, 2024

View reviewed changes

enable multithreaded copy only when openmp found

b0157de

razvnane requested review from blattms and multitalentloes April 3, 2024 13:39

add support for single thread copy

cc1dfca

multitalentloes approved these changes Apr 15, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Overlap creation of jacobian matrix with GPU data transfers #5256

Overlap creation of jacobian matrix with GPU data transfers #5256

razvnane commented Mar 15, 2024

multitalentloes Apr 2, 2024

razvnane Apr 3, 2024

multitalentloes commented Apr 2, 2024

blattms Apr 2, 2024

razvnane Apr 3, 2024

razvnane commented Apr 3, 2024

bska commented Apr 7, 2024

multitalentloes commented Apr 11, 2024

razvnane commented Apr 12, 2024

multitalentloes commented Apr 15, 2024

bska commented Apr 15, 2024

Overlap creation of jacobian matrix with GPU data transfers #5256

Are you sure you want to change the base?

Overlap creation of jacobian matrix with GPU data transfers #5256

Conversation

razvnane commented Mar 15, 2024

multitalentloes Apr 2, 2024

Choose a reason for hiding this comment

razvnane Apr 3, 2024

Choose a reason for hiding this comment

multitalentloes commented Apr 2, 2024

blattms Apr 2, 2024

Choose a reason for hiding this comment

razvnane Apr 3, 2024

Choose a reason for hiding this comment

razvnane commented Apr 3, 2024

bska commented Apr 7, 2024

multitalentloes commented Apr 11, 2024

razvnane commented Apr 12, 2024

multitalentloes commented Apr 15, 2024

bska commented Apr 15, 2024