New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracker issue for BLIS support in NumPy #7372
Comments
I see that BLIS has been included to |
The description at https://www.cs.utexas.edu/%7Eflame/web/libFLAME.html may need fixing, but from that it's not clear to me that libFLAME actually implements the LAPACK API. |
@rgommers libflame does indeed provide netlib LAPACK APIs, and implementations for everything that libflame does not provide natively. |
FWIW, BLIS has made significant strides since this issue was opened. For example, BLIS now implements runtime configuration, and its configure-time configuration has been reimplemented in terms of the runtime infrastructure. BLIS now offers integrated BLAS test drivers in addition to its more comprehensive testsuite. Library self-initialization is also in place, as is monolithic header generation (single blis.h instead of 500+ development headers), which makes management of the installed product easier. It also follows a more standardized library naming convention for its static and shared library builds, includes an |
@fgvanzee thanks. In that case I'm +1 to add support for it in I'm really short on time at the moment, so cannot work on this probably till September at least. If someone else wants to tackle this, it should be relatively straightforward (along the same lines as gh-7294). Happy to help troubleshoot / review. |
That sounds like major progress. How far would you say are you from being a viable alternative to OpenBLAS for numpy/scipy (performance & stability wise)? (Note that we still don't have a scipy-openblas on Windows for |
If you want a MSVC ABI compatible BLAS and LAPACK, there still aren't any easy, entirely open source options. Though nowadays with clang-cl and flang existing, the problem isn't compiler availability like it used to be, now it's build system flexibility and trying to use combinations that library authors have never evaluated or supported before. Ref flame/blis#57 (comment) |
@rgommers I would say that BLIS is presently a quite viable alternative to OpenBLAS. It is viable enough that AMD has abandoned ACML and fully embraced BLIS as the foundation of their new open-source math library solution. (We have corporate sponsors, and have been sponsored by the National Science Foundation for many years in the past.) Performance-wise, the exact characterization will depend on the operation you're looking at, the floating-point datatype, the hardware, and the problem size range you're interested in. However, generally-speaking, BLIS typically meets or exceeds OpenBLAS's level-3 performance for all but the smallest problem sizes (less than 300 or so). It also employs a more flexible level-3 parallelization strategy than OpenBLAS is capable of (due to their monolithic assembly kernel design). Stability-wise, I would like to think that BLIS is quite stable. We try to be very responsive to bug reports, and thanks to a surge of interest from the community over the last five or so months we've been able to identify and fix a lot of issues (mostly build system related). This has smoothed out the user experience for end-users as well as package managers. Also, keep in mind that BLIS has provided (from day zero) a superset of BLAS-like functionality, and done so via two novel APIs separate and apart from the BLAS compatibility layer:
This not only supports legacy users who already have software that needs BLAS linkage, but provides a great toolbox for those interested in building custom dense linear algebra solutions from scratch--people who may not feel any particular affinity towards the BLAS interface. Born out of a frustration with the various shortcomings in both existing BLAS implementations as well as the BLAS API itself, BLIS has been my labor of love since 2012. It's not going anywhere, and will only get better. :) |
@rgommers Re: libflame: Thanks for your interest. Just be aware that libflame could use some TLC; it's not in quite as good of shape as BLIS. (We don't have the time/resources to support it as we would like, and almost 100% of our attention over the last six years has been focused on getting BLIS to a place where it could become a viable and competitive alternative to OpenBLAS et al.) At some point, once BLIS matures and our research avenues have been exhausted, we will likely turn our attention back to libflame/LAPACK-level functionality (Cholesky, LU, QR factorizations, for example). This may take the form of incrementally adding those implementations to BLIS, or it may involve an entirely new project to eventually replace libflame. If it is the latter, it will be designed to take advantage of lower-level APIs in BLIS, thus avoiding some function call and memory copy overhead that is currently unavoidable via the BLAS. This is just one of many topics we look forward to investigating. |
I've run the benchmark from this article with NumPy 1.15 and BLIS 0.3.2 on an Intel Skylake without multithreading (I had a hardware instruction error with HT):
Intel MKL 2018.3:
|
@homocomputeris Sorry, I've never heard the "dot" verb used to describe an operation on two matrices before. Is that a matrix multiplication? |
@fgvanzee What is the status of Windows support in BLIS these days? I remember getting it to build on Windows used to be mostly unsupported... |
@fgvanzee and yeah, |
@njsmith The status of "native" Windows support is mostly unchanged. We still lack the expertise and interest in making such support happen, unfortunately. However, since Windows 10 was released, it seems there is an "Ubuntu for Windows" or bash compatibility environment of some sort available. That is probably a much more promising avenue to achieve Windows support. (But again, nobody in our group develops on on or uses Windows, so we haven't even looked into that option, either.) |
Ok one last post for now... @homocomputeris for a benchmark like this it really helps to show some well known library too, like OpenBLAS, because otherwise we have no idea how fast your hardware is. @fgvanzee Speaking of the native strided support, what restrictions on the strides do you have these days? Do they have to be aligned, positive, non-negative, exact multiples of the data size, ...? (As you may remember, numpy arrays allow for totally arbitrary strides measured in bytes.) |
@fgvanzee "bash for Windows" is effectively equivalent to running a Linux VM on Windows – a particularly fast and seamless VM, but it's not a native environment. So the good news is that you already support bash for Windows :-), but the bad news is that it's not a substitute for native Windows support. |
@njsmith My results are more or less the same as in the article.
I want to note, that I have no idea how to compile BLIS to use everything that my CPU can optimize and multithread. While MKL has things more or less out of the box. |
@njsmith Thanks for that update. I agree that nothing beats native OS support. I also agree that we need to see the benchmark run with other libraries for us to properly interpret @homocomputeris's timings.
@njsmith Aligned? No. Positive? I think we lifted that constraint, but it's not been thoroughly tested. Exact multiples of the datatype? Yes, still. I'm bringing @devinamatthews into the discussion. Months ago I told him about your request for byte strides, and he had some good points/questions at the time that I can't quite remember. Devin, can you recall your concerns about this, and if so, articulate them to Nathaniel? Thanks. |
@homocomputeris Would you mind rerunning the benchmark with a different value for |
@homocomputeris And did you say that the BLIS results are single-threaded, while the MKL results are multi-threaded? |
FWIW, I looked building BLIS on Windows some time ago. The primary pain point at the moment is the build system. It might be possible to get mingw's make to use clang to produce an MSVC compatible binary. I never got that running with the time I was able to spend on it, but it seems possible. Within the actual source code, the situation isn't too bad. Recently they even transitioned to using macros for their assembly kernels, so that's one more barrier to Windows support eliminated. See flame/blis#220 (comment) and flame/blis#224. It seems like the source files themselves are a few more macros/ifdefs away from building on MSVC, but that's my perspective as an outsider. I also have no idea how to get the existing BLIS makefiles to work with MSVC. |
@insertinterestingnamehere Thanks for chiming in, Ian. You're right that the re-macroized assembly kernels are one step closer to being MSVC-friendly. However, as you point out, our build system was definitely not designed with Windows support in mind. Furthermore, does MSVC support C99 yet? If not, that's another hurdle. (BLIS requires C99.) |
Well, I gave the example above only to show that BLIS is comparable to others, that's why I haven't included anything more specific. But as you ask 😃
Intel MKL 2018.3 limited to 2 threads (that is, my physical CPU cores):
BLIS 0.3.2 compiled with
So, it seems that BLIS definitely should be supported by NumPy at least in Unix/POSIX/whatever-like systems, as I imagine Windows usecase as 'don't touch it if it works' @fgvanzee Why is a power of 2 bad for BLIS? It's quite common for collocations methods if one wants the fastest FFT. |
For numpy et al., it would be sufficient to manage the building in mingw/MSYS2 --- that's what we do currently with openblas on Windows (although this is sort of a hack in itself). It will limit the use to "traditional" APIs that don't involve passing CRT resources across, but that's fine for BLAS/LAPACK. |
@fgvanzee good point about C99. WRT the C99 preprocessor, surprisingly, even the MSVC 2017 preprocessor isn't fully caught up there. Supposedly they are currently fixing that (https://docs.microsoft.com/en-us/cpp/visual-cpp-language-conformance#note_D). |
@fgvanzee @njsmith Here is what we would need to do to support arbitrary byte strides:
This is just for the input and output matrices. If |
|
Good to know. If that is the case, then BLIS is that much closer to being ready to use by numpy. (I would just like to make these in-progress modifications first, which also happen to fix a previously unnoticed race condition.) |
Probably better not to depend on the environment variables being rechecked on every call, because In the longer run, I think it would make sense to start exposing some beyond-bare-BLAS APIs in numpy. One of the things that makes BLIS attractive in the first place is exactly that it provides features that other BLAS libraries don't, like the ability to multiply strided matrices, and there's work afoot to extend the BLAS APIs in a number of ways. We wouldn't want to hard-code library-specific details in numpy's API (e.g., we wouldn't want |
One thing I haven't seen mentioned is the index precision. Most system supplied libraries seem to use 32 bit integers, which is a limitation for some applications these days. At some point is would be good if all the indexes were 64 bits, which probably requires that we supply the library. I don't know what we are currently doing in regard to index size. @matthew-brett Are we still compiling with 32 bit integers? |
@charris The integer size in BLIS is configurable at configure-time: 32 or 64 bits. Furthermore, you can configure the integer size used in the BLAS API independently from the internal integer size. |
@njsmith I've fixed the race condition I mentioned previously and also implemented the thread-safe, per-call multithreading API. Please point your friend towards fa08e5e (or any descendant of that commit). The multithreading documentation has been updated as well and walks the reader through his choices, with basic examples given. The commit is on the EDIT: Links updated to reflect minor fix commit. |
As a possible addition to supported types, what about |
@charris We are in the early stages of considering support for |
@charris According to https://en.wikipedia.org/wiki/Long_double,
From a performance perspective, I don't see any upside to float80 (i.e. x87 long double) because there isn't a SIMD version. If one can write a SIMD version of The float128 implementation in software is at least an order-of-magnitude slower than float64 in hardware. It would be prudent to write a new implementation of float128 that skips all the FPE handling and is amenable to SIMD. The implementation in libquadmath, while correct, isn't worth the attention of a high-performance BLAS implementation like BLIS. |
Yep, it's a problem. I don't think extended precision is worth the effort, and the need for quad precision is spotty, double is good for most things, but when you need it, you need it. I'm not worried about performance, the need isn't for speed but for precision. Note that we just extended support to an ARM64 with quad precision long double, a software implementation of course, but I expect hardware to follow at some point and it might be nice to have something tested and ready to go. |
The BLAS G2 proposal has some consideration of double-double and "reproducible" computations in BLAS. (Reproducible here means deterministic across implementations, but IIUC also involves using higher-precision intermediate values.) |
Excited to see this moving forward! For the record, I'm the one @njsmith was referring to who's been interested in controlling the threading from software. My workloads are embarrassingly parallel at prediction time, and my matrix multiplications are relatively small. So I'd rather parallelise larger units of work. I did some work about a year ago on packaging Blis for PyPi, and adding Cython bindings: https://github.com/explosion/cython-blis I found Blis quite easy to package as a C extension like this. The main stumbling block for me was Windows support. From memory, it was C99 issues, but I might be remembering wrongly. The Cython interface I've added might be of interest. In particular, I'm using Cython's fused types so that there's a single I would be very happy to maintain a stand-alone Blis package, keep the wheels built, maintain a nice Cython interface, etc. I think it would be very nice to have it as a separate package, rather than something integrated within numpy. We could then expose more of Blis's API, without being limited by what other BLAS libraries support. |
@honnibal Sorry for the delay in responding on this thread, Matthew. Thanks for your message. We're always happy to see others get excited about BLIS. Of course, we would be happy to advise whenever is needed if you decided to further integrate into the python ecosystem (an application, a library, module, etc.). As for Windows support, please check out the clang/appveyor support for the Windows ABI that @isuruf recently added. Last I heard from him, it was working as expected, but we don't do any development on Windows here at UT so I can't keep tabs on this myself. (Though Isuru pointed out to me once that I could sign up for appveyor in a manner similar to Travis CI.) Also, please let me know if you have any questions about the per-call threading usage. (I've updated our Multithreading documentation to cover this topic.) |
As of BLIS 0.5.2, we have a Performance document that showcases single-threaded and multithreaded performance of BLIS and other implementations of BLAS for a representative set of datatypes and level-3 operations on a variety of many-core architectures, including Marvell ThunderX2, Intel Skylake-X, Intel Haswell, and AMD Epyc. So if the numpy community is wondering how BLIS stacks up against the other leading BLAS solutions, I invite you to take a quick peek! |
Very interesting, thanks @fgvanzee. I had to look up Epyc - seems that that is a brand name based on the Zen (possibly updated to Zen+ at some point?) architecture. Perhaps better to rename to Zen? For our user base Ryzen/Threadripper are the more interesting brands, they may recognize Zen but probably not Epyc. |
Epyc is the name of the AMD server line. It is the successor to the AMD Opteron products of the past. There is, unfortunately, no unique way for BLIS to label its architectural targets, because the code depends on the vector ISA (e.g. AVX2), the CPU core microarchitecture (e.g. Ice Lake) and the SOC/platform integration (e.g. Intel Xeon Platinum processor). BLIS uses microarchitecture code names in some cases (e.g. Dunnington) but that isn't better for everyone. @fgvanzee You might consider adding the aliases that correspond to the GCC march/mtune/mcpu names... |
@rgommers The subconfiguration within BLIS that covers Ryzen and Epyc is actually already named As for whether Ryzen/Threadripper or Epyc are more interesting brands (even to numpy users), I'll say this: if I could only benchmark one AMD Zen system, it would be the highest-end Epyc, because: (a) it uses a similar microarchitecture to that of Ryzen; (b) it gives me the maximum 64 physical cores (and, as a bonus, those cores are arranged in a somewhat novel, NUMA-like configuration); which (c) places maximal stress on BLIS and the other implementations. And that is basically what we did here. Now, thankfully, there is no rule saying I can only benchmark one Zen system. :) However, there are other hurdles, particularly with regards to gaining access in the first place. I don't have access to any Ryzen/Threadripper systems at the moment. If/when I do gain access, I'll be happy to repeat the experiments and publish the results accordingly. Jeff points out some of the naming pitfalls we face. Generally, we name our subconfigurations and kernel sets in terms of microarchitecture, but there is more nuance yet. For example, we use our @jeffhammond Thanks for your suggestion. I've considered adding those aliases in the past. However, I'm not convinced it's worth it. It will add significant clutter to the configuration registry, and the people who will be looking at it in the first place likely already know about our naming scheme for subconfigurations and kernel sets, and thus won't be confused by the absence of certain microarchitectural revision names in that file (or in the |
yes, naming is always complicated:) thanks for the answers @fgvanzee and @jeffhammond |
The discussion got a bit carried away; what are the remaining issues that need to be resolved to officially support BLIS in numpy? Naively, I tried to run numpy tests with BLIS from conda-forge (cf #14180 (comment) ) and for me, on Linux, all tests passed (but maybe I missed something). Also tried to run test scipy test suite in the same env and there are a number of failures in |
Note that BLIS from conda-forge uses ReferenceLAPACK (netlib) as the LAPACK implementation which uses BLIS as the BLAS implementation and not |
About the BLIS option on conda-forge, am I right that it's single-threaded out of the box (unlike the OpenBLAS option)? |
Probably better to move conda-forge discussions to conda-forge 🙂 |
We have had BLIS support in
There's some discussion in this issue about control of threading - we decided not to deal with that in I don't think there's anything left to do here. BLIS status is good. So I think we can close this issue and declare victory here.
Thanks everyone for the long and interesting conversation. |
Here's a general tracker issue for discussion of BLIS support in NumPy. So far we've had some discussions of this in two nominally unrelated threads:
So this is a new issue to consolidate future discussions like this :-)
Some currently outstanding issues to highlight:
CC: @tkelman @matthew-brett @fgvanzee
The text was updated successfully, but these errors were encountered: