Releases: NVIDIA/cutlass
CUTLASS 3.5.0
- Implicit GEMM Convolutions targeting Hopper SM90A via WGMMA + TMA im2col.
- Native implementation in CUTLASS 3.x using CuTe, mirroring the same design hierarchy as that of GEMMs.
- Support for 1D, 2D, and 3D convolutions in a rank-agnostic fashion.
- Support for Fprop, Dgrad, and Wgrad algorithms.
- CUTLASS profiler support for 2D and 3D convolutions implemented via the 3.x API.
- NOTE: this is a beta release. Further updates to CUTLASS will include major performance improvements, feature enablement, and possible breaking changes to the API until 3.7 release. Your feedback is welcome on the design!
- Support for Ada (SM89) FP8 tensor cores via the 2.x API. Requires CUDA 12.4 or newer.
- Ampere gather/scatter convolution example in CuTe and CUTLASS 3.x.
- Showcasing how custom kernels can be written and optimized using CUTLASS 3.x and CuTe and the general strategy for implementing convolutions as specializations of GETTs.
- Implementation of a coarse grained sparse gather/scatter kernel achieving peak performance on Ampere class tensor cores.
- 32x and 16x tile sizes are added to CUTLASS 2.x to improve the performance of narrow-tall and wide-short matrices.
- Updates to CuTe documentation for
cute::Tensor<>
, MMA atoms, and an overhauled CuTe GEMM tutorial series. - Extensions to CuTe to support L2 prefetching and TMA store+reductions.
- Remove C++11 requirement on a few CUTLASS 2.x API header files. All CUTLASS files now require C++17.
- Fixes to greatly reduce build warnings.
- Updates and bugfixes from the community (thanks!)
CUTLASS 3.4.1
- Statically available CUTLASS Version macros that allow for handling API changes between CUTLASS releases on the users' side.
- Improvements for Hopper Group-GEMMs and Pointer-Array Batched GEMMs.
- Updates and bugfixes from the community (thanks!).
CUTLASS 3.4.0
- Improved Mixed-input Hopper GEMMs supporting {16-bit, 8-bit} x {8-bit, 4-bit} input types with fast numerical converters and group scaling factors tuned for optimal performance on Hopper H100.
- Beta release of Pointer-Array Batched GEMMs utilizing TMA and Hopper H100 tensor cores now available. (Requires CUDA 12.3 or above)
- Beta release of Group-GEMM - commonly used in optimization of Mixture-Of-Expert models, is now available on Hopper GPUs taking advantage of TMA and Hopper H100 tensor cores. (Requires CUDA 12.3 or above)
- Ampere Sparse GEMM supports Epilogue Visitor Tree (EVT) now.
- Impovements to NamedBarriers including details of ReservedNamedBarriers used within the CUTLASS library.
- Improved CuTe documentation including improved clarity and depth of Quickstart, CuTe Layout, and CuTe Layout Algebra. Associated code comments, post-conditions, and details in CuTe Core Unit Tests also improved.
CUTLASS 3.3.0
- New Mixed-input Hopper GEMMs support covering 16-bit x 8-bit input types with optimal performance.
- New Mixed-input Ampere GEMMs with support for canonical layouts (TN). The implementation supports upcast on operandB {fp16, bf16} x {s8, u8} and upcast on operandA {s8, u8} x {fp16, bf16}. They also include fast numeric conversion recipes and warp level shuffles to achieve optimal performance.
- New Copy Async based Hopper GEMMs - which support lower than 16B aligned input tensors (across s8/fp8/fp16/bf16/tf32 types) with optimal performance. As a part of this, new kernel schedules, and Copy Ops SM80_CP_ASYNC_CACHE_* were also added.
- EVT Support for RELU with Aux bitmap tensor store (used in dRELU). See SM90 EVT fusions for details.
- Various subbyte enhancements like tagged device ptrs, support for vectorized copy, various operators to treat subbyte iterators as pointers, and full-fledged CuTe Tensor support.
- Support for Clang as a host compiler.
- Support for void-C kernels and SM80 mixed-input GEMMs in the CUTLASS Python interface
CUTLASS 3.2.2
Bug fix for illegal memory access issue hit by Flash Attention tests in PyTorch. See #1138 for details.
CUTLASS 3.2.1
- Python support SM90 Epilogue Visitor Tree (EVT) on top of the C++ support released in 3.2.0.
- SM80 EVT support in C++ and Python.
- Other SM90 epilogue improvements.
- Splitting CUTLASS library into smaller units based on operation, arch and datatypes. See #1105 for details.
- Making tools/library/scripts packageable - tools/library/scripts is now moving to python/cutlass_library. See the Python README for details.
- SM90 TF32 kernel improvements for all layouts.
- SM90 rasterization direction support in the CUTLASS profiler.
- Improvement for CUTLASS profiler build times.
- Remove Python-C++ bindings.
CUTLASS 3.2
- New warp-specialized persistent FP8 GEMM kernel kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters. An example showcasing Hopper warp-specialized FP8 GEMMs.
- New Epilogue Visitor Tree (EVT) support for Hopper TMA epilogues. EVTs allows for user-defined customized epilogue fusion patterns without having to write a new epilogue.
- Stream-K feature for Hopper. Note that this is only a functional implementation of stream-K, and should not be used for performance comparison. Optimizations are expected in a future release.
- Improved CTA rasterization and support for CTA swizzling for Hopper kernels using the Tile Scheduler.
- Improved performance for warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
- Hopper GEMM+Permute, an example of fusing tensor reordering (permutation) with GEMM mainloop or epilogue.
- New CUTLASS 2D Convolution Python interface. New example here.
- Support for Windows (MSVC) builds.
CUTLASS 3.1
- New CUTLASS Python interface that aims to provide an ease-of-use interface for instantiating, emitting, compiling, and running CUTLASS kernels via Python. More details here and new examples.
- New efficient epilogues using TMA for Hopper.
- Support for fused epilogues, such Bias, ReLU and GELU, using the new efficient epilogues.
- New warp-specialized TensorFloat-32 (TF32) GEMM kernels targeting Hopper TMA.
- New warp-specialized persistent cooperative kernel design that allows for larger tile sizes and improves performance on Hopper.
- An example showcasing GEMM-Like Tensor-Tensor Contraction (GETT) capability on Hopper.
- Epilogue builders. Similar to mainloop builders (see example 49), epilogue builders aim to generate the best-possible epilogue while exposing incremental opt-ins for greater customization.
- Profiler support for overriding kernel and epilogue builder auto schedules for 3.x API kernels, allowing specific policies to be run in the CUTLASS profiler.
- Performance optimizations for the warp-specialized persistent ping-pong kernel.
- Changes to the GEMM API 3.x, involving the host-facing arguments and the underlying
Params
structs. - FMHA Backward Pass from Meta xFormers.
- Streamk GEMM with Broadcast enables epilogue broadcast with StreamK GEMM.
- Batched B2B GEMM now can run multiple Back-to-Back GEMM with the same problem size in parallel.
- Batched Strided GEMV support both row major and column major input matrix.
- Permute + GEMM fusion can fuse Permute with following GEMM now. Before, we only support fusing GEMM with Permute in the epilogue.
- Row Broadcast can be fused in the epilogue.
- The GitHub branch is renamed from
master
tomain
in this release. - Optimal performance using CUDA 12.1
- Updates and bugfixes from the community (thanks!)
CUTLASS 3.0
3.0.0 (2023-01-23)
- CuTe, a new core library and backend for CUTLASS 3.0 that defines a single Layout vocabulary type and an associated algebra of layouts for a much more expressive and composable abstraction for tensors, sets of parallel agents, and operations by said agents on tensors.
- A new conceptual operation hierarchy that replaces the architecture-centric hierarchy of CUTLASS 2.x and documentation for CUTLASS 3.0's GEMM API changes.
- Strict API backwards compatibility that exposes both 2.x and 3.x API kernels through the same
device::GemmUniversalAdapter
andkernel::GemmUniversal
types, allowing users to include both APIs in the same translation units. More information can be found in the 3.x backwards compatibility section. - Updates to Functionality which directs users on which kernels are supported via CUTLASS-2 and CUTLASS-3.
- Updates to Compatibility Section regarding supported compilers, operating systems, CUDA Toolkits, Hardware Architectures and Target Architecture.
- New warp-specialized GEMM kernel schedules and mainloops targeting Hopper architecture that achieve great performance with TMA, WGMMA, and threadblock clusters.
- Extensions to CUTLASS profiler to support threadblock cluster shapes in library and profiler tile configurations.
- CUTLASS library integration for 3.x API kernels built through the new
CollectiveBuilder
API, enabling CUTLASS profiler. - Support for Hopper GEMMs through the new 3.0 API with CuTe-based exposure of the Hopper Tensor Memory Accelerator and WGMMA Tensor Core features.
- Set of examples that demonstrate the usage of the new 3.0 API to easily build GEMM kernels targeting Hopper: examples 48, 49, and 50.
CUTLASS 2.11
2.11.0 (2022-11-19)
-
Stream-K, which is a new general way to do split-K. It can not only improve performance, but can also significantly reduce the number of tile sizes that need to be profiled to find the best one.
-
Fused multi-head attention Kernel. It has two variants: one uses batched GEMM for the fixed sequence length, and the other one uses group GEMM for the variable sequence length. Both versions just need one kernel.
-
Dual GEMM, which can fuse A x B and A x C into one kernel. Two GEMMs has no producer-consumer dependency.
-
Hopper improves double precision matrix multiplication by 2x compared to Ampere at iso-clocks. It is supported since CUDA 11.8.
-
BLAS3 functions with Hoppers new double precision matrix multiplication instructions.
-
ELL Block Sparse GEMM, which uses an ELL matrix to describe the sparsity of A matrix. B and output matrices are still dense. The block size can be arbitary.
-
Optimized Group Conv for SingleGroup mode, which requires that the output channel per group is a multiple of Threadblock tile N.
-
Optimized DepthWise Conv. Two new modes are added
- kOptimized - use direct conv to compute instead of implicit GEMM.
- The restrictions are: 1) input ,output channel and group number should be multiple of (128 / sizeof(input element)). 2) The input filter size should be the same as the template parameter configuration.
- kFixedStrideDilation - which puts stride and dilation into templates to further improve the performance. In this mode, kernel persistents some inputs into register to squeeze more performance, so large filter/stride/dilation is not recommanded.
- The restrictions are: 1) input, output channel and group number should be multiple of (128 / sizeof(input element)). 2) input filter size, stride, dilation should same as the template parameter configuration.
- kOptimized - use direct conv to compute instead of implicit GEMM.
-
Scripts to fuse multiple back-to-back GEMM. Its implementation was discussed in a GTC'22 Spring talk.
-
Updates and bugfixes from the community (thanks!). Big shout out to Meta's xFormers.
-
Deprecation announcement: CUTLASS plans to deprecate the following:
- Maxwell and Pascal GPU architectures
- Ubuntu 16.04
- CUDA 10.2