Fairseq(-py)
is a sequence modeling toolkit that allows researchers and
developers to train custom models for translation, summarization, language
modeling and other text generation tasks.
This clone of fairseq supports Knowledge Distillation
, Recurrent Stacking
, LoRA
RoPE
, YaRN
and ALiBi
for the Transformer
model and the translation
task. You can add the following flags to fairseq-train
/fairseq-interactive
/fairseq-generate
to use them:
Name and Citation | Description | Flags to Activate | Source |
---|---|---|---|
Knowledge Distillation (Hinton et al., Kim & Rush, Wang et al., Gumma et al.) | Transfers soft information from a pretrained teacher model to a smaller student model | --teacher-checkpoint-path $teacher_ckpt --task translation_with_kd --criterion label_smoothed_cross_entropy_with_kd --kd-args '{"strategy": "word_level"}' |
Selective Distillation |
Recurrent Stacking (Dabre & Fujita) | Extreme parameter sharing technique in which all layers in the encoder/decoder are shared | --encoder-recurrent-stacking $encoder_recurrent_stacking --decoder-recurrent-stacking $decoder_recurrent_stacking |
- |
Low-Rank Adaptation (LoRA) (Hu et al.) | Efficient model adaptation technique that modifies a small number of model parameters while freezing the rest | --lora-args '{"r": 8, "alpha": 16, "dropout": 0.05, "bias": "none, "target_modules": "k_proj,v_proj"}' --use-native-attention --load-checkpoint-liberally |
LoRA Implementation |
Rotary Positional Embedding (RoPE) (Su et al.) | Encodes absolute position with a rotation matrix and incorporates explicit relative position dependency in self-attention formulation | --rope-args '{"max_position_embeddings": 2048, "base": 10000, "type": "vanilla"}' --use-native-attention --no-token-positional-embeddings |
RoPE Implementation |
Yet another RoPE extensioN method (YaRN) (Peng et al.) | Compute-efficient method to extend the context window of models | --yarn-args '{"max_position_embeddings": 2048, "base": 10000, "type": "vanilla", "original_max_position_embeddings": 256, "extrapolation_factor": 1, "attn_factor": 1, "beta_fast": 32, "beta_slow": 1}' --use-native-attention --no-token-positional-embeddings |
YaRN Implementation |
Attention with Linear Biases (ALiBi) (Press et al.) | Simple and efficient position method that biases query-key attention scores with a penalty proportional to their distance | --alibi-args '{"alibi_asymmetrical": "false"}' --no-token-positional-embeddings --load-checkpoint-liberally |
ALiBi Implementation |
Factorized Embedding Parameterization (Lan et al.) | Parameterizes large embeddings by adding an intermediate bottleneck layer | --encoder-factorized-embed-dim $encoder_fac_embed_dim --decoder-factorized-embed-dim $decoder_fac_embed_dim --factorized-embed-activation-fn $fac_embed_activation_fn |
- |
Penultimate Linear Transformation Activation | Adds activation to the penultimate linear transformation before the final projection onto the vocabulary | --decoder-output-activation-fn $decoder_out_activation_fn |
- |
Sanity Validation Steps | Runs a full pass over the validation set at the beginning of training | --run-sanity-validation-steps |
- |
- PyTorch version >= 2.1.1
- Python version >= 3.8
- For training new models, you'll also need an NVIDIA GPU and NCCL
- To install fairseq and develop locally:
git clone https://github.com/VarunGumma/fairseq
cd fairseq
pip install -e ./
- For faster training install NVIDIA's apex library:
git clone https://github.com/NVIDIA/apex
cd apex
pip install -v --no-cache-dir --global-option="--cpp_ext" --global-option="--cuda_ext" \
--global-option="--deprecated_fused_adam" --global-option="--xentropy" \
--global-option="--fast_multihead_attn" ./
- For large datasets install PyArrow:
pip install pyarrow
- If you use Docker make sure to increase the shared memory size either with
--ipc=host
or--shm-size
as command line options tonvidia-docker run
.
fairseq(-py)
is MIT-licensed.
The license applies to the pre-trained models as well.
Please cite as:
@misc{gumma2024fairseq,
author = {Varun Gumma},
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/VarunGumma/fairseq}},
}
@inproceedings{ott2019fairseq,
title = {fairseq: A Fast, Extensible Toolkit for Sequence Modeling},
author = {Myle Ott and Sergey Edunov and Alexei Baevski and Angela Fan and Sam Gross and Nathan Ng and David Grangier and Michael Auli},
booktitle = {Proceedings of NAACL-HLT 2019: Demonstrations},
year = {2019},
}
I will try my best to keep this repo synced with the upstream fairseq repository. This clone is very dynamic and can have broken stuff once in a while. So feel free to raise issues or pull requests to clear any bugs or introduce new features.