Skip to content

elphinkuo/fast_matrix_multiplication

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

44 Commits
 
 
 
 
 
 
 
 

Repository files navigation

Fast Matrix Multiplication on CPU

Different matrix multiplication implementation and benchmarking

A few months ago, I had the pleasure of tuning into the Modular AI 2023 product release keynote. It was a riveting experience, filled with anticipation and excitement, especially when the fast Matrix Multiplication section was presented by Jeremy Howard and Chris Lattner. My interest was so piqued that I immediately applied for the product preview, not just once, but several times. I'm still patiently awaiting approval.

During the interim, curiosity got the better of me, especially when it came to the fast matrix multiplication. I embarked on a journey to understand the feature better, and decided to take it a step further - I implemented a few versions of it myself and performed some benchmarks. This hands-on approach gave me a deeper insight into how fast matrix multiplication is implemented.

Wsthin this repository, you'll find that I've conducted two types of matrix multiplication:

  • The first being the dot product of two matrices, both sized at 1024 * 1024. I implemented this type using C++11 standard functions, C++11 using eigen library, C++11 using armadillo library, C++11 using SIMD AVX instuctions and Swift with Apple acceleration framework. The Swift with Apple acceleration framework amazingly accelerate tse matrix multiplication to 532,550.639 times, using the same standard shown in Modular AI product release keynote.
  • The second type is the element-wise multiplication, also known as the Hadamard product, of two matrices of the same size (1024 * 1024).

The element-wise multiplication method, I must note, is particularly prevalent in Deep Learning models like Transformers.

Dot product implementations

Acceleration Results:

Method GFLOP/s Acceleration (times)
Python 3.9 0.002536116 GFLOP/s 1x
C++11 standard 2.31342 GFLOP/s 912x
C++11 SIMD AVX 28.2316 GFLOP/s 11,131.825x
C++11 eigen 33.8274 GFLOP/s 13,338.27x
Python Numpy 87.856 GFLOP/s 34,709.7773x
C++11 armadillo 107.782 GFLOP/s 42,498.95x
Swift Acceleration 1350.607 GFLOP/s 532,550.639x

Explaination:

  • Python 3.9

    code in /dot_product/python_matmul, the first method is picked from and same as Modular AI product release keynote https://youtu.be/-3Kf2ZZU-dg, as benchmark base.

  • C++11 standard

    code in /dot_product/standard_c++

  • C++11 SIMD AVX

    code in /dot_product/simd_avx

    SIMD stands for Single Instruction, Multiple Data. SIMD allows one single operation to be applied to a set of data simultaneously, which is extremely beneficial for tasks such as matrix multiplication, where the same operation (multiplication and addition) is applied over and over.

    For SIMD operations in C++, one can use the data types and intrinsic functions provided by the compiler. These will differ based on your hardware and the specific vector instruction set it supports, such as SSE, AVX, or AVX512 on Intel and AMD CPUs, or NEON on ARM CPUs.

  • C++11 eigen

    code in /dot_product/eigen

  • Numpy

    code in /dot_product/python_matmul, the second method.

  • C++11 Armadillo (42,498.95x acceleration)

    code in /dot_product/armadillo

    Here's a corresponding C++ version, written using the Armadillo library for linear algebra and other mathematical operations. Armadillo is well known for its high performance and efficient memory usage.

    This program creates a matrix class which uses Armadillo's fmat (floating point matrix) to store its data. matmul_native performs matrix multiplication using three nested loops, and benchmark_matmul_native measures the speed of this operation in GFLOPs (billions of floating-point operations per second).

    Please note that this is a simple version of the provided code and might not include all functionality that's implied by the Python code. Also, Armadillo has inbuilt high performance matrix multiplication which could be used instead of the native triple loop method.

  • Swift Acceleration (532,550.639x acceleration)

    code in /dot_product/swift

    In this Swift code, cblas_sgemm is a function from the Accelerate Framework that performs general matrix multiplication. The parameters specify how the matrices are stored in memory and the dimensions of the matrices.

    The cblas_sgemm function is part of Apple's Accelerate Framework, which is a collection of powerful and efficient numerical computing libraries. This function in particular is an implementation of general matrix multiply (GEMM), one of the fundamental operations of numerical linear algebra.

    The Accelerate Framework's version of this function is highly optimized for Apple hardware, using low-level programming and hardware-specific optimizations to achieve impressive speed. This can include the use of SIMD instructions, efficient cache usage, and other advanced techniques.

    Because of these optimizations, and the fact that the function is implemented in low-level code, it can be much faster than a similar function written in Swift (or other high-level languages) without such optimizations. This is why numerical computing libraries like Accelerate (or others like Intel's MKL, OpenBLAS, etc.) are widely used in scientific computing, data science, machine learning, and other fields that require heavy numerical computations.

Strassen Algorithm

Add a cpp implementation of Strassen Algorithm (https://en.wikipedia.org/wiki/Strassen_algorithm), which is a standard fast matrix multiplication of square matrices. The current result shows that Strassen algorithm is even worse than the current standard matrix multiplication in cpp. I doubt the performance of Strassen Algorithm using a modern cpp library. Will continue to dive in this direction in the future.

They speed of Strassen algorithm is indeed quite slow in most cases. The reason of why it is so slow might be the following after analysis: Strassen's algorithm underperforms on modern architectures mainly due to poor cache utilization, the overhead of recursive calls, inefficiency for small matrices, and an increase in addition and subtraction operations. It's also harder to implement and parallelize efficiently compared to classical matrix multiplication algorithms. Therefore, despite its theoretically lower computational complexity, it often falls short in practical applications on current hardware. Details are in: https://github.com/elphinkuo/fast_matrix_multiplication/blob/main/Strassen_Algorithm/README.md

Hadamard product implementations