Skip to content

Commit

Permalink
Disable Managed Memory for The_Arena by default. (#3438)
Browse files Browse the repository at this point in the history
It used to be that The_Arena was managed for CUDA and SYCL, but not for
HIP. The users can still turn it on with
`amrex.the_arena_is_managed=1`. They can also use The_Managed_Arena
explicitly.
  • Loading branch information
WeiqunZhang committed Jul 25, 2023
1 parent ed1b3a1 commit 5dfb040
Show file tree
Hide file tree
Showing 2 changed files with 36 additions and 40 deletions.
68 changes: 34 additions & 34 deletions Docs/sphinx_documentation/source/GPU.rst
Expand Up @@ -74,13 +74,14 @@ detailed throughout the rest of this chapter:
portability while making the code as understandable as possible to
science-focused code teams.

- AMReX utilizes GPU managed memory to automatically handle memory
- AMReX can utilize GPU managed memory to automatically handle memory
movement for mesh and particle data. Simple data structures, such
as :cpp:`IntVect`\s can be passed by value and complex data structures, such as
:cpp:`FArrayBox`\es, have specialized AMReX classes to handle the
data movement for the user. Tests have shown CUDA managed memory
to be efficient and reliable, especially when applications remove
any unnecessary data accesses.
any unnecessary data accesses. However, managed memory is not used by
:cpp:`FArrayBox` and :cpp:`MultiFab` by default.

- Application teams should strive to keep mesh and particle data structures
on the GPU for as long as possible, minimizing movement back to the CPU.
Expand Down Expand Up @@ -505,10 +506,10 @@ to two functions:
void free (void* p);

:cpp:`The_Arena()` is used for memory allocation of data in
:cpp:`BaseFab`. By default, it allocates managed memory. This can be changed with
a boolean runtime parameter ``amrex.the_arena_is_managed``.
Therefore the data in a :cpp:`MultiFab` is placed in
managed memory by default and is accessible from both CPU host and GPU device.
:cpp:`BaseFab`. By default, it allocates device memory. This can be changed with
a boolean runtime parameter ``amrex.the_arena_is_managed=1``.
When managed memory is enabled, the data in a :cpp:`MultiFab` is placed in
device memory by default and is accessible from both CPU host and GPU device.
This allows application codes to develop their GPU capability
gradually. The behavior of :cpp:`The_Managed_Arena()` likewise depends on the
``amrex.the_arena_is_managed`` parameter. If ``amrex.the_arena_is_managed=0``,
Expand Down Expand Up @@ -693,13 +694,13 @@ allocations and deallocations when (for example) resizing vectors.

\end{center}

These classes behave identically to an
These classes behave almost identically to an
:cpp:`amrex::Vector`, (see :ref:`sec:basics:vecandarr`), except that they
can only hold "plain-old-data" objects (e.g. Reals, integers, amrex Particles,
etc... ). If you want a resizable vector that doesn't use a memory Arena,
simply use :cpp:`amrex::Vector`.

Note that, even if the data in the vector is managed and available on GPUs,
Note that, even if the data in the vector is managed and available on GPUs,
the member functions of e.g. :cpp:`Gpu::ManagedVector` are not.
To use the data on the GPU, it is necessary to pass the underlying data pointer
in to the GPU kernels. The managed data pointer can be accessed using the :cpp:`data()`
Expand Down Expand Up @@ -828,8 +829,8 @@ include :cpp:`array`, :cpp:`dataPtr`, :cpp:`box`, :cpp:`nComp`, and

All :cpp:`BaseFab<T>` objects in :cpp:`FabArray<FAB>` are allocated in
CPU memory, including :cpp:`IArrayBox` and :cpp:`FArrayBox`, which are
derived from :cpp:`BaseFab`, although the array data contained are
allocated in managed memory. We cannot pass a :cpp:`BaseFab` object by
derived from :cpp:`BaseFab`, and the array data contained are
allocated in either device or managed memory. We cannot pass a :cpp:`BaseFab` object by
value because they do not have copy constructor. However, we can make
an :cpp:`Array4` using member function :cpp:`BaseFab::array()`, and pass it
by value to GPU kernels. In GPU device code, we can use :cpp:`Array4`
Expand Down Expand Up @@ -1113,7 +1114,7 @@ An example of a 1D function launch is given here:
}

Instead of passing an :cpp:`Array4`, :cpp:`FArrayBox::dataPtr()` is called to obtain a
CUDA managed pointer to the :cpp:`FArrayBox` data. This is an alternative way to access
pointer to the :cpp:`FArrayBox` data. This is an alternative way to access
the :cpp:`FArrayBox` data on the GPU. Instead of passing a :cpp:`Box` to define the loop
bounds, a :cpp:`long` or :cpp:`int` number of elements is passed to bound the single
:cpp:`for` loop. This construct can be used to work on any contiguous set of memory by
Expand Down Expand Up @@ -1201,7 +1202,7 @@ a Fortran function is given here:
The function ``plusone_acc`` is a CPU host function. The
:cpp:`FArrayBox` reference
from :cpp:`operator[]` is a reference to a :cpp:`FArrayBox` in host
memory with data that has been placed in managed CUDA memory.
memory with data that has been placed in GPU memory.
``BL_TO_FORTRAN_BOX`` and ``BL_TO_FORTRAN_ANYD`` behave identically
to implementations used on the CPU. These macros return the
individual components of the AMReX C++ objects to allow passing to
Expand All @@ -1213,7 +1214,7 @@ The corresponding OpenACC labelled loop in ``plusone_acc`` is:

::

!dat = pointer to fab's managed data
!dat = pointer to fab's GPU data

!$acc kernels deviceptr(dat)
do k = lo(3), hi(3)
Expand All @@ -1226,7 +1227,7 @@ The corresponding OpenACC labelled loop in ``plusone_acc`` is:
!$acc end kernels

Since the data pointer passed to ``plusone_acc`` points to
unified memory, OpenACC can be told the data is available on the
device memory, OpenACC can be told the data is available on the
device using the ``deviceptr`` construct. For further details
about OpenACC programming, consult the OpenACC user's guide.

Expand All @@ -1238,7 +1239,7 @@ OpenMP labelled version of this loop is:

::

!dat = pointer to fab's managed data
!dat = pointer to fab's GPU data

!$omp target teams distribute parallel do collapse(3) schedule(static,1) is_device_ptr(dat)
do k = lo(3), hi(3)
Expand Down Expand Up @@ -1297,7 +1298,7 @@ not work as intended. For example,
class MyClass {
public:
Box bx;
int m; // Unmanaged integer created on the host.
int m; // integer created on the host.
void f () {
amrex::launch(bx,
[=] AMREX_GPU_DEVICE (Box const& tbx)
Expand Down Expand Up @@ -1548,8 +1549,8 @@ Particle Support
.. _sec:gpu:particle:

As with ``MultiFab``, particle data stored in AMReX ``ParticleContainer`` classes are
stored in unified memory when AMReX is compiled with ``USE_CUDA=TRUE``. This means that the :cpp:`dataPtr` associated with particles
is managed and can be passed into GPU kernels. These kernels can be launched with a variety of approaches,
stored in GPU memory when AMReX is compiled with ``USE_CUDA=TRUE``. This means that the :cpp:`dataPtr` associated with particles
can be passed into GPU kernels. These kernels can be launched with a variety of approaches,
including Cuda C / Fortran and OpenACC. An example Fortran particle subroutine offloaded via OpenACC might
look like the following:

Expand Down Expand Up @@ -1583,7 +1584,7 @@ look like the following:
end subroutine push_position_boris

Note the use of the :fortran:`!$acc parallel deviceptr` clause to specify which data has been placed
in managed memory. This instructs OpenACC to treat those variables as if they already live on
in device memory. This instructs OpenACC to treat those variables as if they already live on
the device, bypassing the usual copies. For complete examples of a particle code that has been ported
to GPUs using Cuda, OpenACC, and OpenMP, please see the tutorial `Electromagnetic PIC`_.

Expand Down Expand Up @@ -1704,18 +1705,17 @@ Inputs Parameters
The following inputs parameters control the behavior of amrex when running on GPUs. They should be prefaced
by "amrex" in your :cpp:`inputs` file.

+----------------------------+-----------------------------------------------------------------------+-------------+------------------+
| | Description | Type | Default |
+============================+=======================================================================+=============+==================+
| use_gpu_aware_mpi | Whether to use GPU memory for communication buffers during MPI calls. | Bool | False |
| | If true, the buffers will use device memory. If false, they will use | | |
| | pinned memory. In practice, we find it is usually not worth it to use | | |
| | GPU aware MPI. | | |
+----------------------------+-----------------------------------------------------------------------+-------------+------------------+
| abort_on_out_of_gpu_memory | If the size of free memory on the GPU is less than the size of a | Bool | False |
| | requested allocation, AMReX will call AMReX::Abort() with an error | | |
| | describing how much free memory there is and what was requested. | | |
+----------------------------+-----------------------------------------------------------------------+-------------+------------------+
| the_arena_is_managed | Whether :cpp:`The_Arena()` allocates managed memory. | Bool | True (CUDA/SYCL) |
| | | | False (HIP) |
+----------------------------+-----------------------------------------------------------------------+-------------+------------------+
+----------------------------+-----------------------------------------------------------------------+-------------+----------+
| | Description | Type | Default |
+============================+=======================================================================+=============+==========+
| use_gpu_aware_mpi | Whether to use GPU memory for communication buffers during MPI calls. | Bool | False |
| | If true, the buffers will use device memory. If false, they will use | | |
| | pinned memory. In practice, we find it is usually not worth it to use | | |
| | GPU aware MPI. | | |
+----------------------------+-----------------------------------------------------------------------+-------------+----------+
| abort_on_out_of_gpu_memory | If the size of free memory on the GPU is less than the size of a | Bool | False |
| | requested allocation, AMReX will call AMReX::Abort() with an error | | |
| | describing how much free memory there is and what was requested. | | |
+----------------------------+-----------------------------------------------------------------------+-------------+----------+
| the_arena_is_managed | Whether :cpp:`The_Arena()` allocates managed memory. | Bool | False |
+----------------------------+-----------------------------------------------------------------------+-------------+----------+
8 changes: 2 additions & 6 deletions Src/Base/AMReX_Arena.cpp
Expand Up @@ -47,11 +47,7 @@ namespace {
Long the_pinned_arena_release_threshold = std::numeric_limits<Long>::max();
Long the_comms_arena_release_threshold = std::numeric_limits<Long>::max();
Long the_async_arena_release_threshold = std::numeric_limits<Long>::max();
#ifdef AMREX_USE_HIP
bool the_arena_is_managed = false; // xxxxx HIP FIX HERE
#else
bool the_arena_is_managed = true;
#endif
bool the_arena_is_managed = false;
bool abort_on_out_of_gpu_memory = false;
}

Expand Down Expand Up @@ -660,4 +656,4 @@ The_Comms_Arena ()
}
}

}
}

0 comments on commit 5dfb040

Please sign in to comment.