Skip to content

Releases: proger/accelerated-scan

0.2.0 — faster training!

20 May 10:34
db7145f
Compare
Choose a tag to compare

@unixpickle has fused the sequence reversal required by backward into the kernel and vectorized loads and stores to load entries, training is 30-40 percent faster on 3090.

image

0.1.2 — reverse reference scan

31 Jan 17:18
b7e4770
Compare
Choose a tag to compare

This release includes reverse=True flag on accelerated_scan.ref.scan.

Full Changelog: 0.1.1...0.1.2

0.1.1 — 16 bit support

11 Jan 15:24
ad0dbfd
Compare
Choose a tag to compare

This package adds support for float16 and bfloat16 through templating the warp kernel. Below is the plot for max abs errors comparing the reference implementation and the kernel:

image

0.1

10 Jan 10:50
Compare
Choose a tag to compare
0.1

This package implements the fastest first-order parallel associative scan on the GPU for forward and backward.

The scan efficiently solves first-order recurrences of the form x[t] = gate[t] * x[t-1] + token[t], common in state space models and linear RNNs.

The accelerated_scan.warp C++ CUDA kernel uses a chunked processing algorithm that leverages the fastest GPU communication primitives available on each level of hierarchy: warp shuffles within warps of 32 threads and shared memory (SRAM) between warps within a thread block. One sequence per channel dimension is confined to one thread block.

The derivation of Chunked Scan has been used to extend tree-level Blelloch algorithm to block

A similar implementation is available in accelerated_scan.triton using a Triton's tl.associative_scan primitive. It requires Triton 2.2 for its enable_fp_fusion flag.

bench