Porting assembly to the GPU #696

Tongdongq · 2022-06-16T14:24:55Z

We might start trying to accelerate (parts of) the assembly with a GPU at some point.

My initial plan was to:

1. Analyze the callgraph.
2. Create big structure 'DataStorage' and pass it down the assembly.
3. Make sure only data from 'DataStorage' is read/written. This means only the struct needs to be copied to the GPU. A big part of this is probably the intensive quantities cache.
4. Create simple CPU functions.
5. Create GPU kernels from simple CPU functions.

However, we should avoid having two different codebases with the same functionality. Changes to one would need to be made to the other.
Maybe code-generation could help us here.
CUDA does support templates, but OpenCL does not.
Especially the Evalution template is used a lot.

I did some profiling on an older version (Jan 2021) and found this:

fvbaseadlocallinearizer.hh:linearize()   12.05
	updateStencil()                   5.02
	updateAllIntensiveQuantities()    1.79
	updatePVWeights() (is empty)
	updateAllExtensiveQuantities()    1.65
	localResidual_.eval()             2.57
	updateLocalLinearization_()       0.65

The numbers are timings in seconds, for a small sample of NORNE.
Some of these functions are combined into updateAll().
The locallinearizer is called for every element, for every assembly.
Each OpenMP thread has 1 elementcontext, which 'shifts' to the next element, requiring resizing of internal variables. Maybe updating the stencil is not needed on the GPU, if a context could be kept for every element.

It probably is not worthwhile to only port some of these functions to the GPU. It should be all of them.
There are also many optional functions to consider, like the extra modules, or options like enableDissolvedGas() and enableVaporizedOil().

@atgeirr

The text was updated successfully, but these errors were encountered:

atgeirr · 2022-06-16T14:35:17Z

The most complex part to resolve will be the well assembly, which includes solving the well equations in isolation. Maybe it is a good idea to do the reservoir assembly on GPU first, as that is simpler in structure but more time consuming.

For doing that, I think your idea applies. It looks like a very large refactoring, with a complete change to the way assembly is done. What about starting from the other end, by evaluating cell properties (i.e. intensive quantities cache) on the GPU? That is an important subsystem, and there are no dependencies between cells, so it can parallelize perfectly.

Tongdongq · 2022-06-17T08:52:12Z

Is the reservoir assembly the whole grid and resulting large matrix?
Is the well assembly done in opm/simulators/wells/MultisegmentWellEval.cpp?
I see initMatrixAndVectors() (looks like setting up sparsity patterns), initPrimaryVariablesEvaluation(), and updatePrimaryVariables() that are probably used.

Where is that cache located/computed? Also is it named 'cache' because it stores a lot of data that is used often, or because it takes CPU cache hierarchy into account?

atgeirr · 2022-06-17T11:40:27Z

Is the reservoir assembly the whole grid and resulting large matrix?

Yes.

Is the well assembly done in opm/simulators/wells/MultisegmentWellEval.cpp? I see initMatrixAndVectors() (looks like setting up sparsity patterns), initPrimaryVariablesEvaluation(), and updatePrimaryVariables() that are probably used.

For a multisegment well, these and many other functions are used. Basically all of the WellInterface* StandardWell* and MultisegmentWell* classes are involved in well assembly (on an individual well basis), managed on a higher level (collecting the contributions of all wells) by the BlackoilWellModel class.

Where is that cache located/computed?

Main function call triggering recalculation of all the values is invalidateAndUpdateIntensiveQuantities(). The function is in opm-models/opm/models/discretization/common/fvbasediscretization.hh, the calls are in BlackoilModelEbos::updateSolution() and in EclProblem::beginTimeStep(). Their execution is somewhat involved, and we are currently looking at refactoring it for higher performance.

Also is it named 'cache' because it stores a lot of data that is used often, or because it takes CPU cache hierarchy into account?

The first. It does not take CPU cache considerations into account.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Porting assembly to the GPU #696

Porting assembly to the GPU #696

Tongdongq commented Jun 16, 2022

atgeirr commented Jun 16, 2022

Tongdongq commented Jun 17, 2022

atgeirr commented Jun 17, 2022

Porting assembly to the GPU #696

Porting assembly to the GPU #696

Comments

Tongdongq commented Jun 16, 2022

atgeirr commented Jun 16, 2022

Tongdongq commented Jun 17, 2022

atgeirr commented Jun 17, 2022