Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Comparison to Intel MKL (MATLAB) #124

Open
RoyiAvital opened this issue May 5, 2020 · 3 comments
Open

Performance Comparison to Intel MKL (MATLAB) #124

RoyiAvital opened this issue May 5, 2020 · 3 comments

Comments

@RoyiAvital
Copy link

Seeing the benchmarks of BLASFEO and how it beats Intel MKL on small matrices made want to create a MATLAB MEX wrapper for it to speed up small matrices calculations.

The logic was, since BLASFEO beats Intel MKL on tests with no overheads with MEX I'd beat MATLAB by a lot since MATLAB only adds overhead on top of MKL and doesn't use MKL_DIRECT_CALL.
All reasons to be optimistic.

I implemented a MEX wrapper around blasfeo_dgemm() and validated it against MATLAB (The error is almost nothing).

Then I did a run time analysis:

Figure0001

Now, the BLASFEO MEX working in place (Namely it receives a pre allocated matrix to write the result onto) while MATLAB has to use its regular API (Allocates the output, overhead on the input).
Yet still it much faster than BLASFEO compiled with AVX2 code path.

MATLAB does use Multi Threading (I don't know the threshold, but it does as I can see on the CPU Utilization graph). But even for very small matrices (Size 2:10) MATLAB beats BLASFEO.

I think that in order to validate results we need to use the Multi Threaded version of MKL in benchmarks.

This is the analysis MATLAB File - RunTimeAnalysis.zip.

@RoyiAvital
Copy link
Author

RoyiAvital commented May 7, 2020

@giaf , any thought on this?
Could you run your benchmarks on Linux with Multi Threaded version of MKL?

@giaf
Copy link
Owner

giaf commented May 13, 2020

Hi @RoyiAvital
it's difficult for me to judge without playing around myself, there are a lot of factors affecting performance, and also MEX introduces some overhead, it's not for free.
Could you also upload the source code for the MEX wrapper?

On top of that, MKL is not a sitting dog, they have a large team and, according to the MKL release notes, they have been improving their small scale performance too.
While here man power is very limited and divided among multiple projects :p
I think benchmarks on blasfeo.syscop.de have not been updates since the BLASFEO BLAS API paper was initially submitted, so I'll try to find some time to repeat them.

About comparison with multi-threaded MKL, it's not an apple-to-apple comparison, as MKL performance would scale depending on the number of available cores (and then the CPU model), while BLASFEO's one would not.
I eventually want to add multi-thread capabilities to BLASFEO too, at that point such comparison would make sense in my opinion.

But as of now, BLASFEO is designed with the aim of providing fast single-threaded routines for matrices fitting in cache.
There are multiple applications where this is needed, as e.g. in the PLASMA project.
And especially (for applications in our group) to provide the linear algebra framework for the implementation of embedded optimization algorithms for optimal control software such as HPIPM and acados.
This is the main aim in development.

@RoyiAvital
Copy link
Author

RoyiAvital commented May 13, 2020

@giaf ,
I know the idea is to have Single Threaded performance.
Yet I think comparing to Multi Threaded MKL will give a break point where one should use BLASFEO and MKL.
My thought it worth you add those wasn't to show BLASFEO in negative light, on the contrary, it will probably show how close you get even with a single thread.

By the way, MEX indeed has an overhead. But remember MATLAB also calls MKL using its general purpose DLL which has the same overhead to the least (Probably more). Also MATLAB uses the CNR channel of MKL which has lower performance (Yet reproducible results).

This is the MEX file (Change to c postfix):

BLASFEODGEMM.txt

Pay attention it is not fully working and there is no validation of the input in order to compare pure performance (It works for square matrices).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants