Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Overlap creation of jacobian matrix with GPU data transfers #5256

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

razvnane
Copy link
Contributor

This PR optimizes the memory operations time by overlapping the creation (which includes a host memory copy) of the jacobian matrix used in the blockJacobi ILU with copying the matrix data to the GPU. For NORNE this gives a 17% reduction in the total transfer/copy time, while for the bigmodel this reduction is around 50% on a system with a AMD EPYC 7763 64-Core Processor + AMD MI210 GPU / NVIDIA A100 GPU.

Comment on lines 115 to 119
copyThread = std::make_shared<std::thread>([&](){this->copyMatToBlockJac(matrix, *blockJacobiForGPUILU0_);});

// Const_cast needed since the CUDA stuff overwrites values for better matrix condition..
bridge_->solve_system(&matrix, blockJacobiForGPUILU0_.get(),
numJacobiBlocks_, rhs, *wellContribs, result);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Slightly confusing to see the blockJacobiForGPUILU0_ be used, and written to by solve_system, while the copyThread is writing to it. I think it is correct still, even with the replaceZeroDiagonal() in solve_system because either the copy thread or the main thread will write to a certain index in it, but never both... This required some effort to spot so maybe add a comment with why this works to make it more readable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a comment, hopefully it is more clear now. Thanks for pointing this out.

@multitalentloes
Copy link

OPM-simulators mainly uses OMP for multithreading computation/memory transfers. For this reason it might be nice to use opm_get_max_threads()>1 to check whether or not we want to use the multithreaded parallelization of memory transfers proposed in this PR. Even though we do not use OMP here it could be nice to either consistently use the multithreaded option if the user wants it through OMP, or consistently use single-threaded code.

@@ -43,6 +43,10 @@

#include <opm/grid/polyhedralgrid.hh>

#include <thread>

std::shared_ptr<std::thread> copyThread;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a way without the global variable, e.g. passing it around to the method where it is joined?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that is possible, but that would mean to pass it down via solve_system method, so modifying the solvers interface by adding something that is not really related to the linear system data, which I like even less than the global variable. But if this is a strong objection, I modify the code to pass it around, unless there are also other options that I am not aware of.

@razvnane
Copy link
Contributor Author

razvnane commented Apr 3, 2024

OPM-simulators mainly uses OMP for multithreading computation/memory transfers. For this reason it might be nice to use opm_get_max_threads()>1 to check whether or not we want to use the multithreaded parallelization of memory transfers proposed in this PR. Even though we do not use OMP here it could be nice to either consistently use the multithreaded option if the user wants it through OMP, or consistently use single-threaded code.

Ok, I added pragmas to enable multithreaded copy only when OMP is found.

@bska
Copy link
Member

bska commented Apr 7, 2024

jenkins build this please

@multitalentloes
Copy link

The change with the OpenMP processor stuff is almost what I had in mind. It is still possible that the user has OpenMP available, but has explicitly run flow with --threads-per-process=1, is it reasonable in this case to create this extra copy thread? In this case I personally think its best to keep the serial implementation, and if the user wants more than one thread, then go ahead and to this optimization since the user is intending to use multithreading. So basically wrapping the threaded code with an if statement on the number of threads available being > 1.

@razvnane
Copy link
Contributor Author

The change with the OpenMP processor stuff is almost what I had in mind. It is still possible that the user has OpenMP available, but has explicitly run flow with --threads-per-process=1, is it reasonable in this case to create this extra copy thread? In this case I personally think its best to keep the serial implementation, and if the user wants more than one thread, then go ahead and to this optimization since the user is intending to use multithreading. So basically wrapping the threaded code with an if statement on the number of threads available being > 1.

Done.

@multitalentloes
Copy link

This looks good to me.

@bska
Copy link
Member

bska commented Apr 15, 2024

jenkins build this please

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants