Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracker issue for BLIS support in NumPy #7372

Closed
njsmith opened this issue Mar 2, 2016 · 98 comments
Closed

Tracker issue for BLIS support in NumPy #7372

njsmith opened this issue Mar 2, 2016 · 98 comments

Comments

@njsmith
Copy link
Member

njsmith commented Mar 2, 2016

Here's a general tracker issue for discussion of BLIS support in NumPy. So far we've had some discussions of this in two nominally unrelated threads:

So this is a new issue to consolidate future discussions like this :-)

Some currently outstanding issues to highlight:

CC: @tkelman @matthew-brett @fgvanzee

@homocomputeris
Copy link

I see that BLIS has been included to site.cfg.
Can libFLAME be also supported?

@rgommers
Copy link
Member

rgommers commented Jul 2, 2018

The description at https://www.cs.utexas.edu/%7Eflame/web/libFLAME.html may need fixing, but from that it's not clear to me that libFLAME actually implements the LAPACK API.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@rgommers libflame does indeed provide netlib LAPACK APIs, and implementations for everything that libflame does not provide natively.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

FWIW, BLIS has made significant strides since this issue was opened. For example, BLIS now implements runtime configuration, and its configure-time configuration has been reimplemented in terms of the runtime infrastructure. BLIS now offers integrated BLAS test drivers in addition to its more comprehensive testsuite. Library self-initialization is also in place, as is monolithic header generation (single blis.h instead of 500+ development headers), which makes management of the installed product easier. It also follows a more standardized library naming convention for its static and shared library builds, includes an soname. Finally, its build system is a lot smarter vis-a-vis checking for compiler and assembler compatibility. And that's just what I can think of off the top of my head.

@rgommers
Copy link
Member

rgommers commented Jul 2, 2018

@fgvanzee thanks. In that case I'm +1 to add support for it in numpy.distutils.

I'm really short on time at the moment, so cannot work on this probably till September at least. If someone else wants to tackle this, it should be relatively straightforward (along the same lines as gh-7294). Happy to help troubleshoot / review.

@rgommers
Copy link
Member

rgommers commented Jul 2, 2018

That sounds like major progress. How far would you say are you from being a viable alternative to OpenBLAS for numpy/scipy (performance & stability wise)?

(Note that we still don't have a scipy-openblas on Windows for conda due to the Fortran mess: https://github.com/conda-forge/scipy-feedstock/blob/master/recipe/meta.yaml#L14)

@tkelman
Copy link
Contributor

tkelman commented Jul 2, 2018

If you want a MSVC ABI compatible BLAS and LAPACK, there still aren't any easy, entirely open source options. Though nowadays with clang-cl and flang existing, the problem isn't compiler availability like it used to be, now it's build system flexibility and trying to use combinations that library authors have never evaluated or supported before. Ref flame/blis#57 (comment)

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@rgommers I would say that BLIS is presently a quite viable alternative to OpenBLAS. It is viable enough that AMD has abandoned ACML and fully embraced BLIS as the foundation of their new open-source math library solution. (We have corporate sponsors, and have been sponsored by the National Science Foundation for many years in the past.)

Performance-wise, the exact characterization will depend on the operation you're looking at, the floating-point datatype, the hardware, and the problem size range you're interested in. However, generally-speaking, BLIS typically meets or exceeds OpenBLAS's level-3 performance for all but the smallest problem sizes (less than 300 or so). It also employs a more flexible level-3 parallelization strategy than OpenBLAS is capable of (due to their monolithic assembly kernel design).

Stability-wise, I would like to think that BLIS is quite stable. We try to be very responsive to bug reports, and thanks to a surge of interest from the community over the last five or so months we've been able to identify and fix a lot of issues (mostly build system related). This has smoothed out the user experience for end-users as well as package managers.

Also, keep in mind that BLIS has provided (from day zero) a superset of BLAS-like functionality, and done so via two novel APIs separate and apart from the BLAS compatibility layer:

  • an explicitly typed BLAS-like API
  • an implicitly typed object-based API

This not only supports legacy users who already have software that needs BLAS linkage, but provides a great toolbox for those interested in building custom dense linear algebra solutions from scratch--people who may not feel any particular affinity towards the BLAS interface.

Born out of a frustration with the various shortcomings in both existing BLAS implementations as well as the BLAS API itself, BLIS has been my labor of love since 2012. It's not going anywhere, and will only get better. :)

@rgommers
Copy link
Member

rgommers commented Jul 2, 2018

Ah thanks @tkelman. Cygwin:(:(

It would be interesting to hear some experiences from people using numpy compiled against BLIS on Linux/macOS though.

Thanks for the context @fgvanzee. Will be interesting for us to add libFLAME support and try on the SciPy benchmark suite.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@rgommers Re: libflame: Thanks for your interest. Just be aware that libflame could use some TLC; it's not in quite as good of shape as BLIS. (We don't have the time/resources to support it as we would like, and almost 100% of our attention over the last six years has been focused on getting BLIS to a place where it could become a viable and competitive alternative to OpenBLAS et al.)

At some point, once BLIS matures and our research avenues have been exhausted, we will likely turn our attention back to libflame/LAPACK-level functionality (Cholesky, LU, QR factorizations, for example). This may take the form of incrementally adding those implementations to BLIS, or it may involve an entirely new project to eventually replace libflame. If it is the latter, it will be designed to take advantage of lower-level APIs in BLIS, thus avoiding some function call and memory copy overhead that is currently unavoidable via the BLAS. This is just one of many topics we look forward to investigating.

@homocomputeris
Copy link

homocomputeris commented Jul 2, 2018

I've run the benchmark from this article with NumPy 1.15 and BLIS 0.3.2 on an Intel Skylake without multithreading (I had a hardware instruction error with HT):

Dotted two 4096x4096 matrices in 4.29 s.
Dotted two vectors of length 524288 in 0.39 ms.
SVD of a 2048x1024 matrix in 13.60 s.
Cholesky decomposition of a 2048x2048 matrix in 2.21 s.
Eigendecomposition of a 2048x2048 matrix in 67.65 s.

Intel MKL 2018.3:

Dotted two 4096x4096 matrices in 2.09 s.
Dotted two vectors of length 524288 in 0.23 ms.
SVD of a 2048x1024 matrix in 1.11 s.
Cholesky decomposition of a 2048x2048 matrix in 0.19 s.
Eigendecomposition of a 2048x2048 matrix in 7.83 s.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

Dotted two 4096x4096 matrices in 4.29 s.

@homocomputeris Sorry, I've never heard the "dot" verb used to describe an operation on two matrices before. Is that a matrix multiplication?

@njsmith
Copy link
Member Author

njsmith commented Jul 2, 2018

@fgvanzee What is the status of Windows support in BLIS these days? I remember getting it to build on Windows used to be mostly unsupported...

@njsmith
Copy link
Member Author

njsmith commented Jul 2, 2018

@fgvanzee and yeah, numpy.dot is the traditional way of calling GEMM in Python. (Sort of an odd name, but it's because it handles vector-vector, vector-matrix, matrix-matrix all in the same API.)

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@njsmith The status of "native" Windows support is mostly unchanged. We still lack the expertise and interest in making such support happen, unfortunately. However, since Windows 10 was released, it seems there is an "Ubuntu for Windows" or bash compatibility environment of some sort available. That is probably a much more promising avenue to achieve Windows support. (But again, nobody in our group develops on on or uses Windows, so we haven't even looked into that option, either.)

@njsmith
Copy link
Member Author

njsmith commented Jul 2, 2018

Ok one last post for now...

@homocomputeris for a benchmark like this it really helps to show some well known library too, like OpenBLAS, because otherwise we have no idea how fast your hardware is.

@fgvanzee Speaking of the native strided support, what restrictions on the strides do you have these days? Do they have to be aligned, positive, non-negative, exact multiples of the data size, ...? (As you may remember, numpy arrays allow for totally arbitrary strides measured in bytes.)

@njsmith
Copy link
Member Author

njsmith commented Jul 2, 2018

@fgvanzee "bash for Windows" is effectively equivalent to running a Linux VM on Windows – a particularly fast and seamless VM, but it's not a native environment. So the good news is that you already support bash for Windows :-), but the bad news is that it's not a substitute for native Windows support.

@homocomputeris
Copy link

homocomputeris commented Jul 2, 2018

@njsmith My results are more or less the same as in the article.
Latest MKL, for example:

Dotted two 4096x4096 matrices in 2.09 s.
Dotted two vectors of length 524288 in 0.23 ms.
SVD of a 2048x1024 matrix in 1.11 s.
Cholesky decomposition of a 2048x2048 matrix in 0.19 s.
Eigendecomposition of a 2048x2048 matrix in 7.83 s.

I want to note, that I have no idea how to compile BLIS to use everything that my CPU can optimize and multithread. While MKL has things more or less out of the box.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@njsmith Thanks for that update. I agree that nothing beats native OS support. I also agree that we need to see the benchmark run with other libraries for us to properly interpret @homocomputeris's timings.

Speaking of the native strided support, what restrictions on the strides do you have these days? Do they have to be aligned, positive, non-negative, exact multiples of the data size, ...? (As you may remember, numpy arrays allow for totally arbitrary strides measured in bytes.)

@njsmith Aligned? No. Positive? I think we lifted that constraint, but it's not been thoroughly tested. Exact multiples of the datatype? Yes, still.

I'm bringing @devinamatthews into the discussion. Months ago I told him about your request for byte strides, and he had some good points/questions at the time that I can't quite remember. Devin, can you recall your concerns about this, and if so, articulate them to Nathaniel? Thanks.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@homocomputeris Would you mind rerunning the benchmark with a different value for size? I wonder if the value the author used (4096) being a power of two is a particularly bad use case for BLIS, and not particularly realistic for most applications anyway. I suggest trying 4000 (or 3000 or 2000) instead.

@njsmith
Copy link
Member Author

njsmith commented Jul 2, 2018

@homocomputeris And did you say that the BLIS results are single-threaded, while the MKL results are multi-threaded?

@insertinterestingnamehere
Copy link
Contributor

FWIW, I looked building BLIS on Windows some time ago. The primary pain point at the moment is the build system. It might be possible to get mingw's make to use clang to produce an MSVC compatible binary. I never got that running with the time I was able to spend on it, but it seems possible.

Within the actual source code, the situation isn't too bad. Recently they even transitioned to using macros for their assembly kernels, so that's one more barrier to Windows support eliminated. See flame/blis#220 (comment) and flame/blis#224. It seems like the source files themselves are a few more macros/ifdefs away from building on MSVC, but that's my perspective as an outsider. I also have no idea how to get the existing BLIS makefiles to work with MSVC.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@insertinterestingnamehere Thanks for chiming in, Ian. You're right that the re-macroized assembly kernels are one step closer to being MSVC-friendly. However, as you point out, our build system was definitely not designed with Windows support in mind. Furthermore, does MSVC support C99 yet? If not, that's another hurdle. (BLIS requires C99.)

@homocomputeris
Copy link

homocomputeris commented Jul 2, 2018

Well, I gave the example above only to show that BLIS is comparable to others, that's why I haven't included anything more specific.

But as you ask 😃

  • Intel Core i5-6260U Processor with latest BIOS and whatever patches for Spectre/Meltdown
  • Linux 4.17.3-1-ARCH
  • everything is compiled with gcc 8.1.1 20180531
  • NumPy 1.15.0rc1
  • I've chosen a prime for matrix dimensions

Intel MKL 2018.3 limited to 2 threads (that is, my physical CPU cores):

Dotted two 3851x3851 matrices in 1.62 s.
Dotted two vectors of length 492928 in 0.18 ms.
SVD of a 1925x962 matrix in 0.54 s.
Cholesky decomposition of a 1925x1925 matrix in 0.10 s.
Eigendecomposition of a 1925x1925 matrix in 4.38 s.

BLIS 0.3.2 compiled with
CFLAGS+=" -fPIC" ./configure --enable-cblas --enable-threading=openmp --enable-shared x86_64

Dotted two 3851x3851 matrices in 3.82 s.
Dotted two vectors of length 492928 in 0.39 ms.
SVD of a 1925x962 matrix in 12.82 s.
Cholesky decomposition of a 1925x1925 matrix in 2.02 s.
Eigendecomposition of a 1925x1925 matrix in 67.80 s.

So, it seems that BLIS definitely should be supported by NumPy at least in Unix/POSIX/whatever-like systems, as I imagine Windows usecase as 'don't touch it if it works'
The only thing I don't know is the connection between MKL/BLIS and LAPACK/libFLAME. Intel claims they have many things optimized besides BLAS, like LAPACK, FFT etc.

@fgvanzee Why is a power of 2 bad for BLIS? It's quite common for collocations methods if one wants the fastest FFT.

@pv
Copy link
Member

pv commented Jul 2, 2018

For numpy et al., it would be sufficient to manage the building in mingw/MSYS2 --- that's what we do currently with openblas on Windows (although this is sort of a hack in itself). It will limit the use to "traditional" APIs that don't involve passing CRT resources across, but that's fine for BLAS/LAPACK.

@insertinterestingnamehere
Copy link
Contributor

@fgvanzee good point about C99. WRT the C99 preprocessor, surprisingly, even the MSVC 2017 preprocessor isn't fully caught up there. Supposedly they are currently fixing that (https://docs.microsoft.com/en-us/cpp/visual-cpp-language-conformance#note_D).

@devinamatthews
Copy link

@fgvanzee @njsmith Here is what we would need to do to support arbitrary byte strides:

  1. Modify the interface. The most expedient thing that occurs to me is to add something like a stride_units flag in the object interface.
  2. Refactor all of the internals to use only byte strides. This may not be a bad idea in any case.
  3. When packing, check for data type alignment and if not, use the generic packing kernel.
    a) The generic packing kernel would also have to be updated to use memcpy. If we can finagle it to use a literal size parameter then it shouldn't suck horribly.
  4. When C is unaligned, also use a virtual microkernel that accesses C using memcpy.

This is just for the input and output matrices. If alpha and beta can be arbitrary pointers then there are more issues. Note that on x86 you can read/write unaligned data just fine, but other architectures (esp. ARM) would be a problem. The compiler can also introduce additional alignment problems when auto-vectorizing.

@fgvanzee
Copy link

fgvanzee commented Jul 2, 2018

@homocomputeris:

  1. I didn't mean to imply that powers of two never arise "out in the wild," only that they are waaaay overrepresented in benchmarks, likely because us computer-oriented humans like to count in powers of two. :)
  2. Those benchmark results are really similar. I would love it if the answer to this next question is "no", but is it possible that you accidentally ran both benchmarks with MKL (or BLIS) linked?
  3. I completely agree that powers of two arise in FFT-related applications. I used to work in signal processing, so I understand. :)
  4. My concern with BLIS not doing well with powers of two is actually a concern not unique to BLIS. However, it may be that the phenomenon we're observing is more pronounced with BLIS, and therefore a net "penalty" for BLIS relative to a ridiculously optimized solution such as MKL. The concern is as follows: when matrices are of dimension that is a power of two, it is likely that their "leading dimension" is also a power of two. (The leading dimension corresponds to the column stride when the matrix is column-stored, or the row stride when row-stored.) Let's assume for a moment row storage. When the leading dimension is a power of two, the cache line in which element (i,j) resides lives in the same associativity set as the cache line in which elements (i+1,j), (i+2,j), (i+3,j) etc live--that is, the same elements of subsequent rows. This means that when the gemm operation updates, say, a double-precision real 6x8 microtile of C, those 6 rows all map to the same associativity set in the L1 cache, and inevitably some of these get evicted before being reused. These so-called conflict misses will show up in our performance graphs as occasional spikes down in performance. As far as I know, there is no easy way around this performance hit. We already pack/copy matrices A and B, so this doesn't affect them as much, but we can't pack matrix C to some more favorable leading dimension without taking a huge memory copy hit. (The cure would be worse than the ailment.) Now, maybe MKL has a way of mitigating this, maybe switching to a differently-shaped microkernel that minimizes the number of conflict misses. Or maybe they don't, but I know that BLIS doesn't try to do anything to mitigate this. Hopefully that answers your question.
  5. You're right that MKL is more than just BLAS+LAPACK functionality. However, keep in mind that MKL is a commercially held, closed-source solution. While it is available for non-commercial purposes "for free," but there's no guarantee that Intel won't make MKL unavailable to the public in the future, or start charging for it again. Plus, it's not really that great for us computer scientists who want to understand the implementation, or tweak or modify the implementation, or to build our research upon well-understood building blocks. That said, if all you want to do is solve your problem and move on with your day, and you're okay expressing it via BLAS, it's great. :)

@fgvanzee
Copy link

I think for numpy users having a way to change the current value of the global (process-level) parallelism level is enough. I personally don't mind if it's achieved via changing the current value of the env variable as long as this change it taken into account for subsequent BLAS-3 calls.

Good to know. If that is the case, then BLIS is that much closer to being ready to use by numpy. (I would just like to make these in-progress modifications first, which also happen to fix a previously unnoticed race condition.)

@njsmith
Copy link
Member Author

njsmith commented Jul 14, 2018

Probably better not to depend on the environment variables being rechecked on every call, because getenv isn't super fast, so it might make sense to remove it later. (Also, I don't think it's even guaranteed to be thread-safe?) But bli_thread_set_num_threads API calls should be fine, since even if BLIS stops calling getenv all the time then the API calls can be adjusted to keep working regardless.

In the longer run, I think it would make sense to start exposing some beyond-bare-BLAS APIs in numpy. One of the things that makes BLIS attractive in the first place is exactly that it provides features that other BLAS libraries don't, like the ability to multiply strided matrices, and there's work afoot to extend the BLAS APIs in a number of ways.

We wouldn't want to hard-code library-specific details in numpy's API (e.g., we wouldn't want np.matmul to start taking arguments corresponding to BLIS's JC, IC, JR, and IR parameters), but it might well make sense to provide a generic "how many threads for this call" argument that only works on backends that provide that functionality.

@charris
Copy link
Member

charris commented Jul 14, 2018

One thing I haven't seen mentioned is the index precision. Most system supplied libraries seem to use 32 bit integers, which is a limitation for some applications these days. At some point is would be good if all the indexes were 64 bits, which probably requires that we supply the library. I don't know what we are currently doing in regard to index size. @matthew-brett Are we still compiling with 32 bit integers?

@fgvanzee
Copy link

@charris The integer size in BLIS is configurable at configure-time: 32 or 64 bits. Furthermore, you can configure the integer size used in the BLAS API independently from the internal integer size.

@fgvanzee
Copy link

fgvanzee commented Jul 17, 2018

Actually that reminds me: I was talking to someone last week who knew that in his library he wanted to invoke gemm single-threaded, because he was managing threading at a higher level, and he was frustrated that with standard blas libraries the only way to control this is via global settings that are pretty ride to call from inside a random library.

@njsmith I've fixed the race condition I mentioned previously and also implemented the thread-safe, per-call multithreading API. Please point your friend towards fa08e5e (or any descendant of that commit). The multithreading documentation has been updated as well and walks the reader through his choices, with basic examples given. The commit is on the dev branch for now, but I expect to merge it to master soon. (I've already put the code through most of its paces.)

EDIT: Links updated to reflect minor fix commit.

@charris
Copy link
Member

charris commented Jul 18, 2018

As a possible addition to supported types, what about long double? I note some architectures are starting to support quad precision (still in software) and I expect that at some point extended precision will be replaced with quad precision on intel. I don't think this is immediately pressing, but I think that after all these years things are beginning to go that way.

@fgvanzee
Copy link

@charris We are in the early stages of considering support for bfloat16 and/or float16 in particular because of their machine learning / AI applications, but we are also aware of demand for double double and quad-precision. We would need to lay some groundwork for it to be feasible for the entire framework, but it is definitely on our medium- to long-term radar.

@jeffhammond
Copy link

@charris According to https://en.wikipedia.org/wiki/Long_double, long double can mean a variety of things:

  • 80-bit type implemented in x87, using 12B or 16B storage
  • double precision with MSVC
  • double double precision
  • quadruple precision
    Because the meaning is ambiguous and depends not just on hardware but the compiler used, it's an utter disaster for libraries, because the ABI isn't well-defined.

From a performance perspective, I don't see any upside to float80 (i.e. x87 long double) because there isn't a SIMD version. If one can write a SIMD version of double double in BLIS, that should perform better.

The float128 implementation in software is at least an order-of-magnitude slower than float64 in hardware. It would be prudent to write a new implementation of float128 that skips all the FPE handling and is amenable to SIMD. The implementation in libquadmath, while correct, isn't worth the attention of a high-performance BLAS implementation like BLIS.

@charris
Copy link
Member

charris commented Jul 18, 2018

Yep, it's a problem. I don't think extended precision is worth the effort, and the need for quad precision is spotty, double is good for most things, but when you need it, you need it. I'm not worried about performance, the need isn't for speed but for precision. Note that we just extended support to an ARM64 with quad precision long double, a software implementation of course, but I expect hardware to follow at some point and it might be nice to have something tested and ready to go.

@njsmith
Copy link
Member Author

njsmith commented Jul 18, 2018

The BLAS G2 proposal has some consideration of double-double and "reproducible" computations in BLAS. (Reproducible here means deterministic across implementations, but IIUC also involves using higher-precision intermediate values.)

@honnibal
Copy link

honnibal commented Jul 18, 2018

Excited to see this moving forward!

For the record, I'm the one @njsmith was referring to who's been interested in controlling the threading from software. My workloads are embarrassingly parallel at prediction time, and my matrix multiplications are relatively small. So I'd rather parallelise larger units of work.

I did some work about a year ago on packaging Blis for PyPi, and adding Cython bindings: https://github.com/explosion/cython-blis

I found Blis quite easy to package as a C extension like this. The main stumbling block for me was Windows support. From memory, it was C99 issues, but I might be remembering wrongly.

The Cython interface I've added might be of interest. In particular, I'm using Cython's fused types so that there's a single nogil function that can be called with either a memory-view or a raw pointer, for both the float and double types. Adding more branches for more types is no problem either. Fused types are basically templates: they allow compile-time conditional execution, for zero overhead.

I would be very happy to maintain a stand-alone Blis package, keep the wheels built, maintain a nice Cython interface, etc. I think it would be very nice to have it as a separate package, rather than something integrated within numpy. We could then expose more of Blis's API, without being limited by what other BLAS libraries support.

@fgvanzee
Copy link

@honnibal Sorry for the delay in responding on this thread, Matthew.

Thanks for your message. We're always happy to see others get excited about BLIS. Of course, we would be happy to advise whenever is needed if you decided to further integrate into the python ecosystem (an application, a library, module, etc.).

As for Windows support, please check out the clang/appveyor support for the Windows ABI that @isuruf recently added. Last I heard from him, it was working as expected, but we don't do any development on Windows here at UT so I can't keep tabs on this myself. (Though Isuru pointed out to me once that I could sign up for appveyor in a manner similar to Travis CI.)

Also, please let me know if you have any questions about the per-call threading usage. (I've updated our Multithreading documentation to cover this topic.)

@fgvanzee
Copy link

fgvanzee commented Apr 1, 2019

As of BLIS 0.5.2, we have a Performance document that showcases single-threaded and multithreaded performance of BLIS and other implementations of BLAS for a representative set of datatypes and level-3 operations on a variety of many-core architectures, including Marvell ThunderX2, Intel Skylake-X, Intel Haswell, and AMD Epyc.

So if the numpy community is wondering how BLIS stacks up against the other leading BLAS solutions, I invite you to take a quick peek!

@rgommers
Copy link
Member

rgommers commented Apr 2, 2019

Very interesting, thanks @fgvanzee.

I had to look up Epyc - seems that that is a brand name based on the Zen (possibly updated to Zen+ at some point?) architecture. Perhaps better to rename to Zen? For our user base Ryzen/Threadripper are the more interesting brands, they may recognize Zen but probably not Epyc.

@jeffhammond
Copy link

Epyc is the name of the AMD server line. It is the successor to the AMD Opteron products of the past.

There is, unfortunately, no unique way for BLIS to label its architectural targets, because the code depends on the vector ISA (e.g. AVX2), the CPU core microarchitecture (e.g. Ice Lake) and the SOC/platform integration (e.g. Intel Xeon Platinum processor). BLIS uses microarchitecture code names in some cases (e.g. Dunnington) but that isn't better for everyone.

@fgvanzee You might consider adding the aliases that correspond to the GCC march/mtune/mcpu names...

@fgvanzee
Copy link

fgvanzee commented Apr 2, 2019

@rgommers The subconfiguration within BLIS that covers Ryzen and Epyc is actually already named zen, as it captures both products.

As for whether Ryzen/Threadripper or Epyc are more interesting brands (even to numpy users), I'll say this: if I could only benchmark one AMD Zen system, it would be the highest-end Epyc, because: (a) it uses a similar microarchitecture to that of Ryzen; (b) it gives me the maximum 64 physical cores (and, as a bonus, those cores are arranged in a somewhat novel, NUMA-like configuration); which (c) places maximal stress on BLIS and the other implementations. And that is basically what we did here.

Now, thankfully, there is no rule saying I can only benchmark one Zen system. :) However, there are other hurdles, particularly with regards to gaining access in the first place. I don't have access to any Ryzen/Threadripper systems at the moment. If/when I do gain access, I'll be happy to repeat the experiments and publish the results accordingly.

Jeff points out some of the naming pitfalls we face. Generally, we name our subconfigurations and kernel sets in terms of microarchitecture, but there is more nuance yet. For example, we use our haswell subconfiguration on Haswell, Broadwell, Skylake, Kaby Lake, and Coffee Lake. That's because they all basically share the same vector ISA, which is pretty much all the BLIS kernel code cares about. But that is an implementation detail that almost no users need to be concerned with. If you use ./configure auto, you will almost always get the best subconfiguration and kernel set for your system, whether they are named zen or haswell or whatever. For now, you still need to take a more hands-on approach when it comes to optimally choosing your threading scheme, and that's where the SoC/platform integration that Jeff mentions comes in.

@jeffhammond Thanks for your suggestion. I've considered adding those aliases in the past. However, I'm not convinced it's worth it. It will add significant clutter to the configuration registry, and the people who will be looking at it in the first place likely already know about our naming scheme for subconfigurations and kernel sets, and thus won't be confused by the absence of certain microarchitectural revision names in that file (or in the config directory). Now, if BLIS required manual identification of subconfiguration, via ./configure haswell for example, then I think the scales definitely tip in favor of your proposal. But ./configure auto works quite well, so I don't see the need at this time. (If you like, you can open an issue on this topic so we can start a wider discussion among community members. I'm always open to changing my mind if there is sufficient demand.)

@rgommers
Copy link
Member

rgommers commented Apr 3, 2019

yes, naming is always complicated:) thanks for the answers @fgvanzee and @jeffhammond

@homocomputeris
Copy link

homocomputeris commented May 16, 2019

#13132 and #13158 are related

@rth
Copy link
Contributor

rth commented Aug 30, 2019

The discussion got a bit carried away; what are the remaining issues that need to be resolved to officially support BLIS in numpy?

Naively, I tried to run numpy tests with BLIS from conda-forge (cf #14180 (comment) ) and for me, on Linux, all tests passed (but maybe I missed something).

Also tried to run test scipy test suite in the same env and there are a number of failures in scipy.linalg (cf scipy/scipy#10744) in case someone has comments on that.

@isuruf
Copy link
Contributor

isuruf commented Aug 30, 2019

Note that BLIS from conda-forge uses ReferenceLAPACK (netlib) as the LAPACK implementation which uses BLIS as the BLAS implementation and not libflame.

@echuber2
Copy link

echuber2 commented Oct 2, 2020

About the BLIS option on conda-forge, am I right that it's single-threaded out of the box (unlike the OpenBLAS option)?

@jakirkham
Copy link
Contributor

Probably better to move conda-forge discussions to conda-forge 🙂

@rgommers
Copy link
Member

rgommers commented Jan 6, 2022

We have had BLIS support in numpy.distutils for several years now. Conda-forge has the option to use BLIS at runtime, and we have thorough testing results for NumPy and SciPy that shows the full test suite passes with BLIS on multiple platforms:

There's some discussion in this issue about control of threading - we decided not to deal with that in numpy directly, that's what threadpoolctl is for.

I don't think there's anything left to do here. BLIS status is good. So I think we can close this issue and declare victory here.

libflame is another story of course - there's some info on that in this issue, but I think that deserves its own tracking issue if we want to consider it. https://github.com/flame/libflame seems to have no new commits since May 2019, so it's unclear whether or not we need to actively look at it. It should be possible to build against libflame with a site.cfg right now, and auto-detection can always be added later (it's not hard).

Thanks everyone for the long and interesting conversation.

@rgommers rgommers closed this as completed Jan 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests