Skip to content

Releases: huggingface/transformers

v4.38.2

01 Mar 03:24
Compare
Choose a tag to compare

Fix backward compatibility issues with Llama and Gemma:

We mostly made sure that performances are not affected by the new change of paradigm with ROPE. Fixed the ROPE computation (should always be in float32) and the causal_mask dtype was set to bool to take less RAM.

YOLOS had a regression, and Llama / T5Tokenizer had a warning popping for random reasons

  • FIX [Gemma] Fix bad rebase with transformers main (#29170)
  • Improve _update_causal_mask performance (#29210)
  • [T5 and Llama Tokenizer] remove warning (#29346)
  • [Llama ROPE] Fix torch export but also slow downs in forward (#29198)
  • RoPE loses precision for Llama / Gemma + Gemma logits.float() (#29285)
  • Patch YOLOS and others (#29353)
  • Use torch.bool instead of torch.int64 for non-persistant causal mask buffer (#29241)

v4.38.1

22 Feb 00:24
Compare
Choose a tag to compare

Fix eager attention in Gemma!

TLDR:

-        attn_output = attn_output.reshape(bsz, q_len, self.hidden_size)
+        attn_output = attn_output.view(bsz, q_len, -1)

v4.38: Gemma, Depth Anything, Stable LM; Static Cache, HF Quantizer, AQLM

21 Feb 13:40
Compare
Choose a tag to compare

New model additions

💎 Gemma 💎

Gemma is a new opensource Language Model series from Google AI that comes with a 2B and 7B variant. The release comes with the pre-trained and instruction fine-tuned versions and you can use them via AutoModelForCausalLM, GemmaForCausalLM or pipeline interface!

Read more about it in the Gemma release blogpost: https://hf.co/blog/gemma

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")
model = AutoModelForCausalLM.from_pretrained("google/gemma-2b", device_map="auto", torch_dtype=torch.float16)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)

You can use the model with Flash Attention, SDPA, Static cache and quantization API for further optimizations !

  • Flash Attention 2
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b", device_map="auto", torch_dtype=torch.float16, attn_implementation="flash_attention_2"
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
  • bitsandbytes-4bit
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b", device_map="auto", load_in_4bit=True
)

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)
  • Static Cache
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("google/gemma-2b")

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-2b", device_map="auto"
)

model.generation_config.cache_implementation = "static"

input_text = "Write me a poem about Machine Learning."
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda")

outputs = model.generate(**input_ids)

Depth Anything Model

The Depth Anything model was proposed in Depth Anything: Unleashing the Power of Large-Scale Unlabeled Data by Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, Hengshuang Zhao. Depth Anything is based on the DPT architecture, trained on ~62 million images, obtaining state-of-the-art results for both relative and absolute depth estimation.

image

Stable LM

StableLM 3B 4E1T was proposed in StableLM 3B 4E1T: Technical Report by Stability AI and is the first model in a series of multi-epoch pre-trained language models.

StableLM 3B 4E1T is a decoder-only base language model pre-trained on 1 trillion tokens of diverse English and code datasets for four epochs. The model architecture is transformer-based with partial Rotary Position Embeddings, SwiGLU activation, LayerNorm, etc.

The team also provides StableLM Zephyr 3B, an instruction fine-tuned version of the model that can be used for chat-based applications.

⚡️ Static cache was introduced in the following PRs ⚡️

Static past key value cache allows LlamaForCausalLM' s forward pass to be compiled using torch.compile !
This means that (cuda) graphs can be used for inference, which speeds up the decoding step by 4x!
A forward pass of Llama2 7B takes around 10.5 ms to run with this on an A100! Equivalent to TGI performances! ⚡️

⚠️ Support for generate is not included yet. This feature is experimental and subject to changes in subsequent releases.

from transformers import AutoTokenizer, AutoModelForCausalLM, StaticCache
import torch
import os

# compilation triggers multiprocessing
os.environ["TOKENIZERS_PARALLELISM"] = "true"

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
model = AutoModelForCausalLM.from_pretrained(
    "meta-llama/Llama-2-7b-hf",
    device_map="auto",
    torch_dtype=torch.float16
)

# set up the static cache in advance of using the model
model._setup_cache(StaticCache, max_batch_size=1, max_cache_len=128)

# trigger compilation!
compiled_model = torch.compile(model, mode="reduce-overhead", fullgraph=True)

# run the model as usual
input_text = "A few facts about the universe: "
input_ids = tokenizer(input_text, return_tensors="pt").to("cuda").input_ids
model_outputs = compiled_model(input_ids)

Quantization

🧼 HF Quantizer 🧼

HfQuantizer makes it easy for quantization method researchers and developers to add inference and / or quantization support in 🤗 transformers. If you are interested in adding the support for new methods, please refer to this documentation page: https://huggingface.co/docs/transformers/main/en/hf_quantizer

⚡️AQLM ⚡️

AQLM is a new quantization method that enables no-performance degradation in 2-bit precision. Check out this demo about how to run Mixtral in 2-bit on a free-tier Google Colab instance: https://huggingface.co/posts/ybelkada/434200761252287

🧼 Moving canonical repositories 🧼

The canonical repositories on the hugging face hub (models that did not have an organization, like bert-base-cased), have been moved under organizations.

You can find the entire list of models moved here: https://huggingface.co/collections/julien-c/canonical-models-65ae66e29d5b422218567567

Redirection has been set up so that your code continues working even if you continue calling the previous paths. We, however, still encourage you to update your code to use the new links so that it is entirely future proof.

Flax Improvements 🚀

The Mistral model was added to the library in Flax.

TensorFlow Improvements 🚀

With Keras 3 becoming the standard version of Keras in TensorFlow 2.16, we've made some internal changes to maintain compatibility. We now have full compatibility with TF 2.16 as long as the tf-keras compatibility package is installed. We've also taken the opportunity to do some cleanup - in particular, the objects like BatchEncoding that are returned by our tokenizers and processors can now be directly passed to Keras methods like model.fit(), which should simplify a lot of code and eliminate a long-standing source of annoyances.

Pre-Trained backbone weights 🚀

Enable loading in pretrained backbones in a new model, where all other weights are randomly initialized. Note: validation checks are still in place when creating a config. Passing in use_pretrained_backbone will raise an error. You can override by setting
config.use_pretrained_backbone = True after creating a config. However, it is not yet guaranteed to be fully backwards compatible.

from transformers import MaskFormerConfig, MaskFormerModel

config = MaskFormerConfig(
	use_pretrained_backbone=False, 
	backbone="microsoft/resnet-18"
)
config.use_pretrained_backbone = True
# Both models have resnet-18 backbone weights and all other weights randomly
# initialized 
model_1 = MaskFormerModel(config)
model_2 = MaskFormerModel(config)

Introduce a helper function load_backbone to load a backbone from a backbone's model config e.g. ResNetConfig, or from a model config which contains backbone information. This enables cleaner modeling files and crossloading between timm and transformers backbones.

from transformers import ResNetConfig, MaskFormerConfig
from transformers.utils.backbone_utils import load_backbone

# Resnet defines the backbone model to load
config = ResNetConfig()
backbone = load_backbone(config)

# Maskformer config defines a model which uses a resnet backbone
config = MaskFormerConfig(use_timm_backbone=True, backbone="resnet18")
backbone = load_backbone(config)

config = MaskFormerConfig(backbone_config=ResNetConfig())
backbone = load_backbone(config)
  • [Backbone] Use `load_backbone...
Read more

Patch release v4.37.2

29 Jan 16:11
Compare
Choose a tag to compare

Selection of fixes

  • Protecting the imports for SigLIP's tokenizer if sentencepiece isn't installed
  • Fix permissions issue on windows machines when using trainer in multi-node setup
  • Allow disabling safe serialization when using Trainer. Needed for Neuron SDK
  • Fix error when loading processor from cache
  • torch < 1.13 compatible torch.load

Commits

  • [Siglip] protect from imports if sentencepiece not installed (#28737)
  • Fix weights_only (#28725)
  • Enable safetensors conversion from PyTorch to other frameworks without the torch requirement (#27599)
  • Don't fail when LocalEntryNotFoundError during processor_config.json loading (#28709)
  • Use save_safetensor to disable safe serialization for XLA (#28669)
  • Fix windows err with checkpoint race conditions (#28637)
  • [SigLIP] Only import tokenizer if sentencepiece available (#28636)

Patch release: v4.37.1

24 Jan 16:15
Compare
Choose a tag to compare

A patch release to resolve import errors from removed custom types in generation utils

  • Add back in generation types #28681

v4.37 Qwen2, Phi-2, SigLIP, ViP-LLaVA, Fast2SpeechConformer, 4-bit serialization, Whisper longform generation

22 Jan 11:20
Compare
Choose a tag to compare

Model releases

Qwen2

Qwen2 is the new model series of large language models from the Qwen team. Previously, the Qwen series was released, including Qwen-72B, Qwen-1.8B, Qwen-VL, Qwen-Audio, etc.

Qwen2 is a language model series including decoder language models of different model sizes. For each size, we release the base language model and the aligned chat model. It is based on the Transformer architecture with SwiGLU activation, attention QKV bias, group query attention, mixture of sliding window attention and full attention, etc. Additionally, we have an improved tokenizer adaptive to multiple natural languages and codes.

Phi-2

Phi-2 is a transformer language model trained by Microsoft with exceptionally strong performance for its small size of 2.7 billion parameters. It was previously available as a custom code model, but has now been fully integrated into transformers.

SigLIP

The SigLIP model was proposed in Sigmoid Loss for Language Image Pre-Training by Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, Lucas Beyer. SigLIP proposes to replace the loss function used in CLIP by a simple pairwise sigmoid loss. This results in better performance in terms of zero-shot classification accuracy on ImageNet.

ViP-LLaVA

The VipLlava model was proposed in Making Large Multimodal Models Understand Arbitrary Visual Prompts by Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, Yong Jae Lee.

VipLlava enhances the training protocol of Llava by marking images and interact with the model using natural cues like a “red bounding box” or “pointed arrow” during training.

FastSpeech2Conformer

The FastSpeech2Conformer model was proposed with the paper Recent Developments On Espnet Toolkit Boosted By Conformer by Pengcheng Guo, Florian Boyer, Xuankai Chang, Tomoki Hayashi, Yosuke Higuchi, Hirofumi Inaguma, Naoyuki Kamo, Chenda Li, Daniel Garcia-Romero, Jiatong Shi, Jing Shi, Shinji Watanabe, Kun Wei, Wangyou Zhang, and Yuekai Zhang.

FastSpeech 2 is a non-autoregressive model for text-to-speech (TTS) synthesis, which develops upon FastSpeech, showing improvements in training speed, inference speed and voice quality. It consists of a variance adapter; duration, energy and pitch predictor and waveform and mel-spectrogram decoder.

Wav2Vec2-BERT

The Wav2Vec2-BERT model was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.

This model was pre-trained on 4.5M hours of unlabeled audio data covering more than 143 languages. It requires finetuning to be used for downstream tasks such as Automatic Speech Recognition (ASR), or Audio Classification.

4-bit serialization

Enables saving and loading transformers models in 4bit formats - you can now push bitsandbytes 4-bit weights on Hugging Face Hub. To save 4-bit models and push them on the hub, simply install the latest bitsandbytes package from pypi pip install -U bitsandbytes, load your model in 4-bit precision and call save_pretrained / push_to_hub. An example repo here

from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "facebook/opt-125m"
model = AutoModelForCausalLM.from_pretrained(model_id, load_in_4bit=True)

model.push_to_hub("ybelkada/opt-125m-bnb-4bit")

4D Attention mask

Enable passing in 4D attention masks to models that support it. This is useful for reducing memory footprint of certain generation tasks.

Improved quantization support

Ability to customise which modules are quantized and which are not.

  • [Awq] Enable the possibility to skip quantization for some target modules by @younesbelkada in #27950
  • add modules_in_block_to_quantize arg in GPTQconfig by @SunMarc in #27956

Added fused modules support

SDPA Support for LLaVa, Mixtral, Mistral

Whisper: Batched state-of-the-art long-form transcription

All decoding strategies (temperature fallback, compression/log-prob/no-speech threshold, ...) of OpenAI's long-form transcription (see: https://github.com/openai/whisper or section 4.5 in paper) have been added. Contrary to https://github.com/openai/whisper, Transformers long-form transcription is fully compatible with pure FP16 and Batching!

For more information see: #27658.

Generation: assisted generation upgrades, speculative decoding, and ngram speculation

Assisted generation was reworked to accept arbitrary sources of candidate sequences. This enabled us to smoothly integrate ngram speculation, and opens the door for new candidate generation methods. Additionally, we've added the speculative decoding strategy on top of assisted generation: when you call assisted generation with an assistant model and do_sample=True, you'll benefit from the faster speculative decoding sampling 🏎️💨

  • Generate: assisted_decoding now accepts arbitrary candidate generators by @gante in #27751
  • Generate: assisted decoding now uses generate for the assistant by @gante in #28031
  • Generate: speculative decoding by @gante in #27979
  • Generate: fix speculative decoding by @gante in #28166
  • Adding Prompt lookup decoding by @apoorvumang in #27775
  • Fix _speculative_sampling implementation by @ofirzaf in #28508

torch.load pickle protection

Adding pickle protection via weights_only=True in the torch.load calls.

Build methods for TensorFlow Models

Unlike PyTorch, TensorFlow models build their weights "lazily" after model initialization, using the shape of their inputs to figure out what their weight shapes should be. We previously needed a full forward pass through TF models to ensure that all layers received an input they could use to build their weights, but with this change we now have proper build() methods that can correctly infer shapes and build model weights. This avoids a whole range of potential issues, as well as significantly accelerating model load times.

Remove support for torch 1.10

The last version to support PyTorch 1.10 was 4.36.x. As it has been more than 2 years, and we're looking forward to using features available in PyTorch 1.11 and up, we do not support PyTorch 1.10 for v4.37 (i.e. we don't run the tests against torch 1.10).

Model tagging

You can now add custom tags into your model before pushing it on the Hub! This enables you to filter models that contain that tag on the Hub with a simple URL filter. For example if you want to filter models that have trl tag you can search: https://huggingface.co/models?other=trl&sort=created

  • [core/ FEAT] Add the possibility to push custom tags using PreTrainedModel itself by @younesbelkada in #28405 - e.g.
from transformers import AutoModelForCausalLM

model_name = "HuggingFaceM4/tiny-random-LlamaForCausalLM"
model = AutoModelForCausalLM.from_pretrained(model_name)

model.add_model_tags(["tag-test"])
model.push_to_hub("llama-tagged")

Bugfixes and improvements

Read more

Patch release: v4.36.2

18 Dec 18:44
Compare
Choose a tag to compare

Patch release to resolve some critical issues relating to the recent cache refactor, flash attention refactor and training in the multi-gpu and multi-node settings:

  • Resolve training bug with PEFT + GC #28031
  • Resolve cache issue when going beyond context window for Mistral/Mixtral FA2 #28037
  • Re-enable passing config to from_pretrained with FA #28043
  • Fix resuming from checkpoint when using FDSP with FULL_STATE_DICT #27891
  • Resolve bug when saving a checkpoint in the multi-node setting #28078

Patch release: v4.36.1

14 Dec 06:57
Compare
Choose a tag to compare

A patch release for critical torch issues mostly:

  • Fix SDPA correctness following torch==2.1.2 regression #27973
  • [Tokenizer Serialization] Fix the broken serialisation #27099
  • Fix bug with rotating checkpoints #28009
  • Hot-fix-mixstral-loss (#27948)

🔥

v4.36: Mixtral, Llava/BakLlava, SeamlessM4T v2, AMD ROCm, F.sdpa wide-spread support

11 Dec 12:12
Compare
Choose a tag to compare

New model additions

Mixtral

Mixtral is the new open-source model from Mistral AI announced by the blogpost Mixtral of Experts. The model has been proven to have comparable capabilities to Chat-GPT according to the benchmark results shared on the release blogpost.

The architecture is a sparse Mixture of Experts with Top-2 routing strategy, similar as NllbMoe architecture in transformers. You can use it through AutoModelForCausalLM interface:

>>> import torch
>>> from transformers import AutoModelForCausalLM, AutoTokenizer

>>> model = AutoModelForCausalLM.from_pretrained("mistralai/Mixtral-8x7B", torch_dtype=torch.float16, device_map="auto")
>>> tokenizer = AutoTokenizer.from_pretrained("mistralai/Mistral-8x7B")

>>> prompt = "My favourite condiment is"

>>> model_inputs = tokenizer([prompt], return_tensors="pt").to(device)
>>> model.to(device)

>>> generated_ids = model.generate(**model_inputs, max_new_tokens=100, do_sample=True)
>>> tokenizer.batch_decode(generated_ids)[0]

The model is compatible with existing optimisation tools such Flash Attention 2, bitsandbytes and PEFT library. The checkpoints are release under mistralai organisation on the Hugging Face Hub.

Llava / BakLlava

Llava is an open-source chatbot trained by fine-tuning LlamA/Vicuna on GPT-generated multimodal instruction-following data. It is an auto-regressive language model, based on the transformer architecture. In other words, it is an multi-modal version of LLMs fine-tuned for chat / instructions.

The Llava model was proposed in Improved Baselines with Visual Instruction Tuning by Haotian Liu, Chunyuan Li, Yuheng Li and Yong Jae Lee.

The integration also includes BakLlava which is a Llava model trained with Mistral backbone.

The mode is compatible with "image-to-text" pipeline:

from transformers import pipeline
from PIL import Image    
import requests

model_id = "llava-hf/llava-1.5-7b-hf"
pipe = pipeline("image-to-text", model=model_id)
url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/ai2d-demo.jpg"

image = Image.open(requests.get(url, stream=True).raw)
prompt = "USER: <image>\nWhat does the label 15 represent? (1) lava (2) core (3) tunnel (4) ash cloud\nASSISTANT:"

outputs = pipe(image, prompt=prompt, generate_kwargs={"max_new_tokens": 200})
print(outputs)

And you can find all Llava weights under llava-hf organisation on the Hub.

SeamlessM4T v2

SeamlessM4T-v2 is a collection of models designed to provide high quality translation, allowing people from different linguistic communities to communicate effortlessly through speech and text. It is an improvement on the previous version and was proposed in Seamless: Multilingual Expressive and Streaming Speech Translation by the Seamless Communication team from Meta AI.

For more details on the differences between v1 and v2, refer to section Difference with SeamlessM4T-v1.

SeamlessM4T enables multiple tasks without relying on separate models:

  • Speech-to-speech translation (S2ST)
  • Speech-to-text translation (S2TT)
  • Text-to-speech translation (T2ST)
  • Text-to-text translation (T2TT)
  • Automatic speech recognition (ASR)

PatchTST

The PatchTST model was proposed in A Time Series is Worth 64 Words: Long-term Forecasting with Transformers by Yuqi Nie, Nam H. Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.

At a high level, the model vectorizes time series into patches of a given size and encodes the resulting sequence of vectors via a Transformer that then outputs the prediction length forecast via an appropriate head. The model is illustrated in the following figure:

patchtst

PatchTSMixer

The PatchTSMixer model was proposed in TSMixer: Lightweight MLP-Mixer Model for Multivariate Time Series Forecasting by Vijay Ekambaram, Arindam Jati, Nam Nguyen, Phanwadee Sinthong and Jayant Kalagnanam.

PatchTSMixer is a lightweight time-series modeling approach based on the MLP-Mixer architecture. In this HuggingFace implementation, we provide PatchTSMixer’s capabilities to effortlessly facilitate lightweight mixing across patches, channels, and hidden features for effective multivariate time-series modeling. It also supports various attention mechanisms starting from simple gated attention to more complex self-attention blocks that can be customized accordingly. The model can be pretrained and subsequently used for various downstream tasks such as forecasting, classification and regression.

CLVP

The CLVP (Contrastive Language-Voice Pretrained Transformer) model was proposed in Better speech synthesis through scaling by James Betker.

Phi-1/1.5

The Phi-1 model was proposed in Textbooks Are All You Need by Suriya Gunasekar, Yi Zhang, Jyoti Aneja, Caio César Teodoro Mendes, Allie Del Giorno, Sivakanth Gopi, Mojan Javaheripi, Piero Kauffmann, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Harkirat Singh Behl, Xin Wang, Sébastien Bubeck, Ronen Eldan, Adam Tauman Kalai, Yin Tat Lee and Yuanzhi Li.

The Phi-1.5 model was proposed in Textbooks Are All You Need II: phi-1.5 technical report by Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar and Yin Tat Lee.

TVP

The text-visual prompting (TVP) framework was proposed in the paper Text-Visual Prompting for Efficient 2D Temporal Video Grounding by Yimeng Zhang, Xin Chen, Jinghan Jia, Sijia Liu, Ke Ding.

This research addresses temporal video grounding (TVG), which is the process of pinpointing the start and end times of specific events in a long video, as described by a text sentence. Text-visual prompting (TVP), is proposed to enhance TVG. TVP involves integrating specially designed patterns, known as ‘prompts’, into both the visual (image-based) and textual (word-based) input components of a TVG model. These prompts provide additional spatial-temporal context, improving the model’s ability to accurately determine event timings in the video. The approach employs 2D visual inputs in place of 3D ones. Although 3D inputs offer more spatial-temporal detail, they are also more time-consuming to process. The use of 2D inputs with the prompting method aims to provide similar levels of context and accuracy more efficiently.

DINOv2 depth estimation

Depth estimation is added to the DINO v2 implementation.

ROCm support for AMD GPUs

AMD's ROCm GPU architecture is now supported across the board and fully tested in our CI with MI210/MI250 GPUs. We further enable specific hardware acceleration for ROCm in Transformers, such as Flash Attention 2, GPTQ quantization and DeepSpeed.

PyTorch scaled_dot_product_attention native support

PyTorch's torch.nn.functional.scaled_dot_product_attention operator is now supported in the most-used Transformers models and used by default when using torch>=2.1.1, allowing to dispatch on memory-efficient attention and Flash Attention backend implementations with no other package than torch required. This should significantly speed up attention computation on hardware that that supports these fastpath.

While Transformers automatically handles the dispatch to use SDPA when available, it is possible to force the usage of a given attention implementation ("eager" being the manual implementation, where each operation is implemented [step by step](https://github.com/huggingface/transformers/blob/9f18cc6df0b7e0d50f78b9e9fc...

Read more

Patch release: v4.35.2

15 Nov 16:39
Compare
Choose a tag to compare

A patch release was made for the following commit:

  • [tokenizers] update tokenizers version pin #27494

to fix all the issues with versioning regarding tokenizers and huggingface_hub