Skip to content

EgorOrachyov/SpBench

Repository files navigation

Sparse Boolean Linear Algebra Libraries Benchmark

Performance evaluation for sparse linear algebra libraries for CPU and GPU targets. This evaluation attempts to measure execution time of common and most popular sparse matrix operations x and + over boolean semiring.

The purpose of the evaluation is to show the performance gain between the Boolean optimized and general-purpose operations. The comparison is not entirely fair, but the Boolean optimized libraries for GPU have not been introduced yet.

Note: some of these tested libraries does not provide extra optimizations for the boolean-values operations. Thus, these operations are imitated by the provided primitives, where non-zero (floating point type) values are interpreted as true.

Tested libraries and frameworks

Library name Target platform Technology Operation x Operation +
cuBool GPU Nvidia Cuda yes yes
clBool GPU OpenCL yes yes
CUSP GPU Nvidia Cuda yes yes
cuSPARSE GPU Nvidia Cuda yes yes
clSPARSE GPU OpenCL yes no
SuiteSparse CPU CPU yes yes

Getting started

Requirements

  • GCC Compiler
  • CMake 3.15 (or greater)
  • Cuda Toolkit
  • OpenCL Dev-Package
  • SuiteSparse

Get source code

Get the source code and go into project directory.

$ git clone https://github.com/EgorOrachyov/SpBench
$ cd SpBench
$ git submodule update --init --recursive

Download data

The dataset for benchmarks is available at Google Drive by the following access link. Dataset is a collection of several square sparse matrices in the .mtx format. These matrices represent various network graphs, which can be used for benchmarking x and + operations, what makes sense.

The dataset is stored as two .zip archives dataset0.zip and dataset1.zip. Download these archives and place it in the repository root into dataset folder. This folder must be created manually.

After that, the path to the archives must be SpBench/dataset/dataset0.zip and SpBench/dataset/dataset1.zip respectively. These archives will be automatically extracted and placed into build directory in time of the build process.

Build

Create build directory and go into it. The example demonstrates, how to build benchmarks in release mode.

$ mkdir build-release
$ cd build-release

Configure build and run actual compilation process.

$ cmake .. -DCMAKE_BUILD_TYPE=Release
$ cmake --build . --target all -j `nproc`

After compilation process benchmark executables will be stored in the build directory. The naming follows the next pattern: {tool-name}_{operation name}.

Generate Data

For benchmarking operation R = A + A^2 we need to generate A^2 matrix for matrix entry in dataset. Matrix A^2 is generated by cuBool library based application. In order to run generation, execute the following script snippet inside build directory:

$ ./cubool_prepare_data data/config_gen_m2.txt

This will generate required A^2 matrices and store it inside the same data folders, as original A matrices with extension suffix .mtx2 instead of .mtx. These matrices are automatically loaded inside benchmarks by MatrixLoader2 class, which automatically appends 2 to the name of the loaded target A matrix and loads A^2 matrix.

Benchmark execution

In order to run benchmark for all tested targets execute the following script snippet inside build directory:

$ bash run_all.sh
$ bash summarize.sh

This will output Summary.txt file with benchmark stats.

Memory profiling

In order to get the peak GPU memory usage, first of all, we need to collect the GPU memory usage with time step in 1 ms. To do this, run in the separate terminal the following command:

$ nvidia-smi --query-gpu=timestamp,pstate,memory.total,memory.used --format=csv -lms 1 -f Profiling.csv

This will run the nvidia-smi built-in memory profiler, which will sample the memory usage stat with 1 ms time step, and save this information into table Profiling.csv.

In the main terminal execute the following command, to run actual libraries benchmarks (since we want to obtain their peak memory usage):

$ bash run_profiling.sh

When the last library finishes its benchmark, go into separate terminal and explicitly terminate the memory capturing process by ctrl + C combination.

After that in your build directory two files will be located: Profiling.csv with memory usage and Profiling-Time.txt with time stamps (begin and end of each experiment). In order to extract actual peak memory usage, we will have to run python script scripts/extract_mem_profile.py. In order to do that, copy this script inside build directory and run it.

When script finises its job, it will output file Profiling-Stats.txt with peak memory usage for each library benchmark and for each matrix entry in dataset.

Dataset

The matrix data is selected from The SuiteSparse Matrix Collection (formerly the University of Florida Sparse Matrix Collection) link.

Matrix name # Rows Nnz M Nnz/row Max Nnz/row Nnz M^2 Nnz M + M^2
0 DIMACS10/wing 62,032 243,088 3.9 4 714,200 917,178
1 DIMACS10/luxembourg_osm 114,599 239,332 2.0 6 393,261 632,185
2 SNAP/amazon0312 400,727 3,200,440 7.9 10 14,390,544 14,968,909
3 LAW/amazon-2008 735,323 5,158,388 7.0 10 25,366,745 26,402,678
4 SNAP/web-Google 916,428 5,105,039 5.5 456 29,710,164 30,811,855
5 SNAP/roadNet-PA 1,090,920 3,083,796 2.8 9 7,238,920 9,931,528
6 SNAP/roadNet-TX 1,393,383 3,843,320 2.7 12 8,903,897 12,264,987
7 DIMACS10/belgium_osm 1,441,295 3,099,940 2.1 10 5,323,073 8,408,599
8 SNAP/roadNet-CA 1,971,281 5,533,214 2.8 12 12,908,450 17,743,342
9 DIMACS10/netherlands_osm 2,216,688 4,882,476 2.2 7 8,755,758 13,626,132

Results

For performance evaluation, PC with Ubuntu 20.04 installed was used. It had Intel core i7-6700 CPU, 3.40GHz, DDR4 64Gb RAM and GeForce GTX 1070 GPU with 8Gb VRAM. Only the execution time of the operations themselves was measured. The actual data were assumed to be loaded into the VRAM or RAM respectively in the appropriate format, required for the target tested framework.

Matrix-matrix multiplication evaluation results.
Time in milliseconds, Memory in megabytes.

cuBool clBool CUSP cuSPARSE clSPARSE SuiteSparse
M Time&Mem Time&Mem Time&Mem Time&Mem Time&Mem Time&Mem
0 1.9 93 1.9 89 5.2 125 20.1 155 4.9 105 7.9 22
1 2.4 91 2.0 89 3.7 111 1.7 151 6.9 97 3.1 169
2 23.2 165 55.5 163 108.5 897 412.8 301 52.2 459 257.6 283
3 33.3 225 82.1 221 172.0 1409 184.8 407 77.4 701 369.5 319
4 41.8 241 127.6 239 246.2 1717 4761.3 439 207.5 1085 673.3 318
5 18.1 157 14.2 153 42.1 481 37.5 247 56.6 283 66.6 294
6 22.6 167 16.9 165 53.1 581 46.7 271 70.4 329 80.7 328
7 23.2 151 16.9 159 32.9 397 26.7 235 68.2 259 56.9 302
8 32.0 199 23.4 211 74.4 771 65.8 325 98.2 433 114.5 344
9 35.3 191 24.9 189 51.0 585 51.4 291 102.8 361 90.9 311

Element-wise matrix-matrix addition evaluation results.
Time in milliseconds, Memory in megabytes.

cuBool clBool CUSP cuSPARSE clSPARSE SuiteSparse
M Time&Mem Time&Mem Time&Mem Time&Mem Time&Mem Time&Mem
0 1.1 95 1.9 105 1.4 105 2.4 163 - - 2.3 176
1 1.7 95 1.6 109 1.0 97 0.8 151 - - 1.6 174
2 11.4 221 23.8 543 16.2 455 24.3 405 - - 37.2 297
3 17.5 323 35.4 877 29.5 723 27.2 595 - - 64.8 319
4 24.8 355 43.1 989 31.9 815 89.0 659 - - 77.5 318
5 16.9 189 12.5 359 11.2 329 11.6 317 - - 36.6 287
6 19.6 209 15.4 429 14.5 385 16.9 357 - - 45.3 319
7 19.5 179 10.5 321 10.2 303 10.5 297 - - 28.5 302
8 30.5 259 22.4 579 19.4 513 20.2 447 - - 65.2 331
9 30.1 233 18.6 457 14.8 423 18.3 385 - - 50.2 311

Known issues

  • clSPARSE based matrix-matrix multiplication benchmark crashes in Release Mode with error: terminate called after throwing an instance of 'std::bad_cast' what(): std::bad_cast

License

This project is licensed under MIT license. Full license text can be found in the license file. For the detailed license information of the dependencies refer to these third party projects official web pages.