Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Porting assembly to the GPU #696

Open
Tongdongq opened this issue Jun 16, 2022 · 3 comments
Open

Porting assembly to the GPU #696

Tongdongq opened this issue Jun 16, 2022 · 3 comments

Comments

@Tongdongq
Copy link

We might start trying to accelerate (parts of) the assembly with a GPU at some point.

My initial plan was to:

1. Analyze the callgraph.
2. Create big structure 'DataStorage' and pass it down the assembly.
3. Make sure only data from 'DataStorage' is read/written. This means only the struct needs to be copied to the GPU. A big part of this is probably the intensive quantities cache.
4. Create simple CPU functions.
5. Create GPU kernels from simple CPU functions.

However, we should avoid having two different codebases with the same functionality. Changes to one would need to be made to the other.
Maybe code-generation could help us here.
CUDA does support templates, but OpenCL does not.
Especially the Evalution template is used a lot.

I did some profiling on an older version (Jan 2021) and found this:

fvbaseadlocallinearizer.hh:linearize()   12.05
	updateStencil()                   5.02
	updateAllIntensiveQuantities()    1.79
	updatePVWeights() (is empty)
	updateAllExtensiveQuantities()    1.65
	localResidual_.eval()             2.57
	updateLocalLinearization_()       0.65

The numbers are timings in seconds, for a small sample of NORNE.
Some of these functions are combined into updateAll().
The locallinearizer is called for every element, for every assembly.
Each OpenMP thread has 1 elementcontext, which 'shifts' to the next element, requiring resizing of internal variables. Maybe updating the stencil is not needed on the GPU, if a context could be kept for every element.

It probably is not worthwhile to only port some of these functions to the GPU. It should be all of them.
There are also many optional functions to consider, like the extra modules, or options like enableDissolvedGas() and enableVaporizedOil().

@atgeirr

@atgeirr
Copy link
Member

atgeirr commented Jun 16, 2022

The most complex part to resolve will be the well assembly, which includes solving the well equations in isolation. Maybe it is a good idea to do the reservoir assembly on GPU first, as that is simpler in structure but more time consuming.

For doing that, I think your idea applies. It looks like a very large refactoring, with a complete change to the way assembly is done. What about starting from the other end, by evaluating cell properties (i.e. intensive quantities cache) on the GPU? That is an important subsystem, and there are no dependencies between cells, so it can parallelize perfectly.

@Tongdongq
Copy link
Author

Is the reservoir assembly the whole grid and resulting large matrix?
Is the well assembly done in opm/simulators/wells/MultisegmentWellEval.cpp?
I see initMatrixAndVectors() (looks like setting up sparsity patterns), initPrimaryVariablesEvaluation(), and updatePrimaryVariables() that are probably used.

Where is that cache located/computed? Also is it named 'cache' because it stores a lot of data that is used often, or because it takes CPU cache hierarchy into account?

@atgeirr
Copy link
Member

atgeirr commented Jun 17, 2022

Is the reservoir assembly the whole grid and resulting large matrix?

Yes.

Is the well assembly done in opm/simulators/wells/MultisegmentWellEval.cpp? I see initMatrixAndVectors() (looks like setting up sparsity patterns), initPrimaryVariablesEvaluation(), and updatePrimaryVariables() that are probably used.

For a multisegment well, these and many other functions are used. Basically all of the WellInterface* StandardWell* and MultisegmentWell* classes are involved in well assembly (on an individual well basis), managed on a higher level (collecting the contributions of all wells) by the BlackoilWellModel class.

Where is that cache located/computed?

Main function call triggering recalculation of all the values is invalidateAndUpdateIntensiveQuantities(). The function is in opm-models/opm/models/discretization/common/fvbasediscretization.hh, the calls are in BlackoilModelEbos::updateSolution() and in EclProblem::beginTimeStep(). Their execution is somewhat involved, and we are currently looking at refactoring it for higher performance.

Also is it named 'cache' because it stores a lot of data that is used often, or because it takes CPU cache hierarchy into account?

The first. It does not take CPU cache considerations into account.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants