Skip to content
afiaka87 edited this page Apr 12, 2021 · 5 revisions

Training with attention

By default DALLE will use full attention for all layers, but you can specify the attention type per layer as follows.

  • full full attention
  • axial_row axial attention, along the rows of the image feature map
  • axial_col axial attention, along the columns of the image feature map
  • conv_like convolution-like attention, for the image feature map
dalle = DALLE(
    # ...
    attn_types = ('full', 'axial_row', 'axial_col', 'conv_like')  # cycles between these four types of attention
)

Each different type is an attempt at replicating the scant details regarding the matter from OpenAI.

What to use:

When in doubt - and if you don't need the VRAM/runtime savings, train with:

attn_types = ('full')

Sparse Attention - Requires CUDA 10.1 and a V100 GPU (for now):

If you can meet these requirements - this is worth the install.

[Install Deepspeed]

dalle = DALLE(
    # ...
    attn_types = ('full', 'sparse')  # cycles between full and sparse attention)