Name		Name	Last commit message	Last commit date
parent directory ..
MHA_bwd.png		MHA_bwd.png
MHA_fwd.png		MHA_fwd.png
README.md		README.md
__init__.py		__init__.py
encdec_multihead_attn.py		encdec_multihead_attn.py
encdec_multihead_attn_func.py		encdec_multihead_attn_func.py
fast_encdec_multihead_attn_func.py		fast_encdec_multihead_attn_func.py
fast_encdec_multihead_attn_norm_add_func.py		fast_encdec_multihead_attn_norm_add_func.py
fast_self_multihead_attn_func.py		fast_self_multihead_attn_func.py
fast_self_multihead_attn_norm_add_func.py		fast_self_multihead_attn_norm_add_func.py
mask_softmax_dropout_func.py		mask_softmax_dropout_func.py
self_multihead_attn.py		self_multihead_attn.py
self_multihead_attn_func.py		self_multihead_attn_func.py

README.md

Fast Multihead Attention

This implementation has two main features :

A C++ implementation to avoid the CPU overheads of Pytorch found with smaller batch sizes.
The removal of all copies and transposes found in standard implementations of Multihead Attention.

	Python Version	C++ Version
Layer Norm and Residual Add Variant	X	X
Includes Linear Biases	X
Reduces CPU Overheads		X
Fuses masking with Softmax		X
Removes Transposes and Copies	X	X
Includes Self and Encoder/Decoder Variants	X	X

How to Instantiate

SelfMultiheadAttn( hidden dim, heads, dropout=prob, bias=bool, include_norm_add=bool, impl='fast' ) EncdecMultiheadAttn( hidden dim, heads, dropout=prob, bias=bool, include_norm_add=bool, impl='fast' )

impl has two options:

fast uses C++ Version
default uses Python Version

Instructions to build on Linux

$ git clone https://github.com/NVIDIA/apex
$ cd apex
$ pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" --global-option="--fast_multihead_attn" ./

Try Performance Tests Yourself!

Perf test script is found here!

cd contrib/examples/multihead_attn

Fast Multihead Attention

python perf_test_multihead_attn.py --ref

Fast Multihead Attention with C++ Implementation

python perf_test_multihead_attn.py

Compare with `torch.nn.MultiheadAttn`

python perf_test_multihead_attn.py --native

Test your own range!

python perf_test_multihead_attn.py --seq-length 64 --num-seqs-start 10 --num-seqs-stop 120 --num-seqs-inc 5

Performance Comparisons

Performance was measured with 64 token sequence lengths on an NVIDIA TitanV card.
Time is measured across multiple layers to simulate an in model scenario.

Files

multihead_attn

Directory actions

More options

Directory actions

More options

Latest commit

History

multihead_attn

Folders and files

parent directory

Fast Multihead Attention

How to Instantiate

Instructions to build on Linux

Try Performance Tests Yourself!

Fast Multihead Attention

Fast Multihead Attention with C++ Implementation

Compare with torch.nn.MultiheadAttn

Test your own range!

Performance Comparisons

Compare with `torch.nn.MultiheadAttn`