Skip to content

Releases: NVIDIA/cutlass

CUTLASS 3.5.0

12 Apr 01:40
7d49e6c
Compare
Choose a tag to compare
  • Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
    • Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
    • Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
    • Support for Fprop, Dgrad, and Wgrad algorithms.
    • CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
    • NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
  • Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
  • Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
    • Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
    • Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
  • 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
  • Updates to CuTe documentation for cute::Tensor<>, MMA atoms, and an overhauled CuTe GEMM tutorial series.
  • Extensions to CuTe to support L2 prefetching and TMA store+reductions.
  • Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
  • Fixes to greatly reduce build warnings.
  • Updates and bugfixes from the community (thanks!)

CUTLASS 3.4.1

15 Feb 21:03
bbe579a
Compare
Choose a tag to compare

CUTLASS 3.4.0

16 Jan 22:39
751eb9a
Compare
Choose a tag to compare
  • Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
  • Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
  • Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
  • Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
  • Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
  • Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.

CUTLASS 3.3.0

06 Dec 01:55
a75b4ac
Compare
Choose a tag to compare
  • New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
  • New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
  • New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
  • EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
  • Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
  • Support for Clang as a host compiler.
  • Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface

CUTLASS 3.2.2

26 Oct 18:17
Compare
Choose a tag to compare

Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.

CUTLASS 3.2.1

26 Sep 21:47
5cd735c
Compare
Choose a tag to compare
  • Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
  • SM80 EVT support in C++ and Python.
  • Other SM90 epilogue improvements.
  • Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See #1105 for details.
  • Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
  • SM90 TF32 kernel improvements for all layouts.
  • SM90 rasterization direction support in the CUTLASS profiler.
  • Improvement for CUTLASS profiler build times.
  • Remove Python-C++ bindings.

CUTLASS 3.2

28 Aug 00:50
3a8f57a
Compare
Choose a tag to compare
  • New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
  • New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
  • Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
  • Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
  • Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
  • New CUTLASS 2D Convolution Python interface. New example here.
  • Support for Windows (MSVC) builds.

CUTLASS 3.1

24 May 20:10
6f47420
Compare
Choose a tag to compare
  • New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
  • New efficient epilogues using TMA for Hopper.
  • Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
  • New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
  • New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
  • An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
  • Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
  • Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
  • Performance optimizations for the warp-specialized persistent ping-pong kernel.
  • Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying Params structs.
  • FMHA Backward Pass from Meta xFormers.
  • Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
  • Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
  • Batched Strided GEMV support both row major and column major input matrix.
  • Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
  • Row Broadcast can be fused in the epilogue.
  • The GitHub branch is renamed from master to main in this release.
  • Optimal performance using CUDA 12.1
  • Updates and bugfixes from the community (thanks!)

CUTLASS 3.0

10 Mar 04:19
c4f6b8c
Compare
Choose a tag to compare

3.0.0 (2023-01-23)

CUTLASS 2.11

20 Jan 21:35
66d9cdd
Compare
Choose a tag to compare

2.11.0 (2022-11-19)

  • Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.

  • Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.

  • Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.

  • Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.

  • BLAS3 functions with Hoppers new double precision matrix multiplication instructions.

  • ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.

  • Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.

  • Optimized DepthWise Conv. Two new modes are added

    • kOptimized - use direct conv to compute instead of implicit GEMM.
      • The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
    • kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
      • The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
  • Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.

  • FP8 data type definition and conversion routines.

  • Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.

  • Deprecation announcement: CUTLASS plans to deprecate the following:

    • Maxwell and Pascal GPU architectures
    • Ubuntu 16.04
    • CUDA 10.2