The `device_matrix` library

device_matrix is a lightweight, transparent, object-oriented and templated C++ library that encapsulates CUDA memory objects (i.e., tensors) and defines common operations on them.

Requirements & installation

To build the library and manage dependencies, we use CMake (version 3.5 and higher). In addition, we rely on the following libraries:

CUDA (version 8 and higher preferred), and
glog (version 0.3.4 and higher).

The cnmem library is used for memory management. The tests are implemented using the googletest and googlemock frameworks. CMake will fetch and compile these libraries automatically as part of the build pipeline. Finally, you need a CUDA-compatible GPU in order to perform any computations.

To install device_matrix, the following instructions should get you started.

git clone https://github.com/cvangysel/device_matrix
cd device_matrix
mkdir build
cd build
cmake ..
make
make test
make install

Please refer to the CMake documentation for advanced options.

Examples

The following examples can also be found in the examples sub-directory of this repository. These examples will also be compiled as part of the build process.

Matrix multiplication

#include <device_matrix/device_matrix.h>
	
#include <glog/logging.h>
#include <memory>

using namespace cuda;
	
int main(int argc, char* argv[]) {
    google::InitGoogleLogging(argv[0]);
	
    const cudaStream_t stream = 0; // default CUDA stream.
	
    std::unique_ptr<device_matrix<float32>> a(
        device_matrix<float32>::create(
            stream,
            {1.0, 2.0, 3.0, 4.0, 5.0, 6.0},
            2 /* num_rows */, 3 /* num_columns */));
	
    std::unique_ptr<device_matrix<float32>> b(
        device_matrix<float32>::create(
            stream,
            {7.0, 8.0, 9.0, 10.0, 11.0, 12.0},
            3 /* num_rows */, 2 /* num_columns */));
	
    device_matrix<float32> c(
        2 /* num_rows */, 2 /* num_columns */, stream);
	
    matrix_mult(stream,
                *a, CUBLAS_OP_N,
                *b, CUBLAS_OP_N,
                &c);
	
    cudaDeviceSynchronize();
	
    print_matrix(c);
}

Custom CUDA kernels

#include <device_matrix/device_matrix.h>

#include <glog/logging.h>
#include <memory>

using namespace cuda;

template <typename FloatT>
__global__
void inverse_kernel(FloatT* const input) {
    size_t offset = threadIdx.y * blockDim.x + threadIdx.x;
    input[offset] = -input[offset];
}

int main(int argc, char* argv[]) {
    google::InitGoogleLogging(argv[0]);

    const cudaStream_t stream = 0; // default CUDA stream.

    std::unique_ptr<device_matrix<float32>> a(
        device_matrix<float32>::create(
            stream,
            {1.0, 2.0, 3.0, 4.0, 5.0, 6.0},
            2 /* num_rows */, 3 /* num_columns */));

    LAUNCH_KERNEL(
        inverse_kernel
            <<<1, /* a single block */
               dim3(a->getRows(), a->getCols()), /* one thread per component */
               0,
               stream>>>(
            a->getData()));

    cudaDeviceSynchronize();

    print_matrix(*a);
}

Design principles

device_matrix was explicitly designed to be inflexible with regards to variable passing/assignment as the lifetime of a device_matrix instance directly corresponds to the lifetime of the CUDA memory region it has allocated. That means that CUDA memory remains allocated as long as its underlying device_matrix exists and that device_matrix instances can only be passed as pointers or references. This gives total control of the CUDA memory allocation to the programmer, as it avoids garbage collection (e.g., Torch) or reference counting (e.g., shared_ptr), and allows for optimized CUDA memory usage. It uses cnmem for its memory management in order to avoid performance issues that occur due to the recurrent re-allocation of memory blocks of a particular size.

To avoid the implicit allocation of on-device memory, any operation resulting in a new allocation needs to be explicit in this. Most operations that return a new result will therefore reuse one of its inputs as destination memory space (in the process, the original input values will be overwritten!). As a result of this, C++ operators that imply value modification were deliberately omitted.

The underlying CUDA memory space can easily be accessed by the library user. This allows the user to write arbitrary CUDA kernels that perform non-standard operations on CUDA objects in-place.

License

device_matrix is licensed under the MIT license. CUDA is a licensed trademark of NVIDIA. Please note that CUDA is licensed separately.

If you modify device_matrix in any way, please link back to this repository.

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
cpp		cpp
examples		examples
include/device_matrix		include/device_matrix
third_party		third_party
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

cpp

cpp

examples

examples

include/device_matrix

include/device_matrix

third_party

third_party

.gitignore

.gitignore

CMakeLists.txt

CMakeLists.txt

LICENSE

LICENSE

README.md

README.md

Repository files navigation

The `device_matrix` library

Requirements & installation

Examples

Matrix multiplication

Custom CUDA kernels

Design principles

License

About

Releases

Packages

Languages

License

cvangysel/device_matrix

Folders and files

Latest commit

History

Repository files navigation

The device_matrix library

Requirements & installation

Examples

Matrix multiplication

Custom CUDA kernels

Design principles

License

About

Topics

Resources

License

Stars

Watchers

Forks

Languages

The `device_matrix` library