Benchmark Convolution Implementations

This repository takes different C++ implementations of the convolution algorithm and provides fully automated benchmarks visualized as charts. It might also be used as a starting point or template for other C++ benchmark projects.

🚀 Features

Insights on how a classical convolution algorithm can be implemented and optimized: BenchmarkCharts.md
A simple (6 lines of code), easy to read (no intrinsics), yet fast convolution implementation: inputPerKernelValueTransposed
Fully automated (CLI) visualization of Catch2 benchmark results using vega-lite charts: chart/README.md
GitHub Actions workflow for fully automated benchmarks on Linux, MacOS and Windows: continuous-integration.yml
Renovate configuration to update CPM.cmake managed C++ dependencies: renovate.json

📖 Related Blog Articles

📈 Results

BenchmarkCharts.md contains the benchmark results visualized as bar charts.

⚒️ Tools

Needs CMake (e.g. with the Visual Studio Installer) to build the project.
Needs node-canvas dependencies to create SVG vector graphics files.
Needs nodejs to build the JavaScript based charts.
Recommends Ninja as "a small build system with a focus on speed".
Uses cpm as a "Setup-free CMake dependency management" for C++.
Uses Catch2 as Unit-Test and Benchmark-Framework for C++.
Uses vega-lite to visualize the benchmark results as a bar chart.
Uses GitHub Actions to fully automate build, benchmarks and visualization.
Uses Renovate to update the dependencies automatically.

⚡️ Commands

Run script/run-all.sh or script\run-all.bat to build and test the project, run the benchmarks and create the charts.
Run script/benchmark-with-charts.sh or script\benchmark-with-charts.bat to run the benchmarks and generate the charts without rebuilding the project.
Further commands and a detailed description can be found in COMMANDS.md.
chart/COMMANDS.md describes the commands to create the charts.

⚙️ Compiler Options for vectorization reports

The following compiler options are used to get detailed vectorization reports. They are defined in CompilerOptions.cmake.

CLang

These compile options are used as described in Auto-Vectorization in LLVM and Clang command line argument reference:

-Rpass=loop-vectorize identifies loops that were successfully vectorized.
-Rpass-missed=loop-vectorize identifies loops that failed vectorization and indicates if vectorization was specified.
-Rpass-analysis=loop-vectorize identifies the statements that caused vectorization to fail. If in addition -fsave-optimization-record is provided, multiple causes of vectorization failure may be listed (this behavior might change in the future).
-fsave-optimization-record generate a YAML optimization record file.

GCC

These compile options are used with GCC as described in GCC Developer Options:

-fopt-info-vec-optimized prints information when an optimization from vectorization passes is successfully.
-fopt-info-vec-missed prints information about missed optimization opportunities from vectorization passes.

MSVC

These compile options are used with MSVC as described in Auto-Vectorizer Reporting Level and Auto-Parallelization and Auto-Vectorization:

/Qvec-report:2 outputs an informational message for loops that are vectorized and for loops that are not vectorized, together with a reason code.
/arch:AVX specifies the architecture for code generation (here AVX).

🔭 Online Tools

Godbolt Compiler Explorer

Godbolt Compiler Explorer is great tool to analyze small blocks of code and get insights in what the compiler does. Its ideal to try things out. The following two links show an easy yet fast convolution implementation, that works well with compiler optimization and vectorization.

Quick C++ Benchmark

Quick C++ Benchmark is another great tool that compares the performance of two code blocks with each other.

Copy & Paste the following code snipped into Quick C++ Benchmark. It shows the difference between the kernel and the input values as outer loop with a "direct form transposed" convolution implementation.

Code Snipped to compare the iteration over the kernel values in the inner vs. the outer loop

#include <vector>
#include <span>
#include <random>

template<typename T>
std::vector<T> randomNumbers(int n, T min, T max)
{
    std::vector<T> vectorWithRandomNumbers(n, T());
    std::random_device randomDevice;
    std::mt19937 gen(randomDevice());
    std::uniform_real_distribution<T> distribution(min, max);
    for (T &value : vectorWithRandomNumbers)
    {
        value = distribution(gen);
    }
    return vectorWithRandomNumbers;
}

template<typename ValueType>
void convolutionKernelInner(const std::span<const ValueType> &input, const std::span<const ValueType> &kernel, const std::span<ValueType> &output)
{
    const int inputLength = input.size();
    const int kernelLength = kernel.size();
    
    for (auto inputIndex = 0; inputIndex < inputLength; ++inputIndex)
    {
      const auto inputValue = input[inputIndex];
      for (auto kernelIndex = 0; kernelIndex < kernelLength; ++kernelIndex)
        {
            output[inputIndex + kernelIndex] += inputValue * kernel[kernelIndex];
        }
    }
}

template<typename ValueType>
void convolutionKernelOuter(const std::span<const ValueType> &input, const std::span<const ValueType> &kernel, const std::span<ValueType> &output)
{
    const auto inputLength = input.size();
    const auto kernelLength = kernel.size();
    
    for (auto kernelIndex = 0; kernelIndex < kernelLength; ++kernelIndex)
    {
        const auto kernelValue = kernel[kernelIndex];
        for (auto inputIndex = 0; inputIndex < inputLength; ++inputIndex)
        {
            output[inputIndex + kernelIndex] += input[inputIndex] * kernelValue;
        }
    }
}

static void kernelInner(benchmark::State& state) {
  // Code before the loop is not measured
  const auto & input = randomNumbers(16384, -1.0F, 1.0F);
  const auto & kernel = randomNumbers(16, 0.0F, 1.0F);

  const auto convolutionLength = input.size() + kernel.size() - 1;
  auto output = std::vector<float>(convolutionLength, 0.0F);
    
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    convolutionKernelInner(std::span(input), std::span(kernel), std::span(output));
    // Make sure the variable is not optimized away by compiler
    benchmark::DoNotOptimize(output);
  }
}
// Register the function as a benchmark
BENCHMARK(kernelInner);

static void kernelOuter(benchmark::State& state) {
  // Code before the loop is not measured
  const auto & input = randomNumbers(16384, -1.0F, 1.0F);
  const auto & kernel = randomNumbers(16, 0.0F, 1.0F);

  const auto convolutionLength = input.size() + kernel.size() - 1;
  auto output = std::vector<float>(convolutionLength, 0.0F);
    
  // Code inside this loop is measured repeatedly
  for (auto _ : state) {
    convolutionKernelOuter(std::span(input), std::span(kernel), std::span(output));
    // Make sure the variable is not optimized away by compiler
    benchmark::DoNotOptimize(output);
  }
}
// Register the function as a benchmark
BENCHMARK(kernelOuter);

📜 History

This repository was initially intended to explore Single Instruction Multiple Data (SIMD) in C++. Since convolution is such an essential part of filtering in digital signal processing and a central part of convolutional neuronal networks (CNN), it seemed obvious to try that first.

A complex topic like this benefits from a "experiment - evaluate" approach to get started. Catch2 is used to assure, that the implementations are correct. It is also used to benchmark their performance. Compiler Options for vectorization reports get insights into what the compiler does. Online-Tools are used to exploratory experiment with implementations. Finally, to get a visual response, vega-lite is used to create bar charts of the benchmark results.

Name		Name	Last commit message	Last commit date
Latest commit History 787 Commits
.github/workflows		.github/workflows
chart		chart
cmake		cmake
script		script
source		source
test		test
.clang-format		.clang-format
.clang-tidy		.clang-tidy
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
COMMANDS.md		COMMANDS.md
LICENSE		LICENSE
README.md		README.md
convolution-benchmarks.code-workspace		convolution-benchmarks.code-workspace
package-lock.cmake		package-lock.cmake
renovate.json		renovate.json

License

JohT/convolution-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Benchmark Convolution Implementations

🚀 Features

📖 Related Blog Articles

📈 Results

⚒️ Tools

⚡️ Commands

⚙️ Compiler Options for vectorization reports

CLang

GCC

MSVC

🔭 Online Tools

Godbolt Compiler Explorer

Quick C++ Benchmark

📜 History

🔎 References

About

Topics

Resources

License

Stars

Watchers

Forks

Languages