31 Jan 19:58

LysandreJik

d426b58

Patch v2.4.1: FlauBERT for AutoModel and AutoTokenizer

Patched an issue where FlauBERT couldn't be loaded with AutoModel and AutoTokenizer classes.

Assets 2

31 Jan 14:55

LysandreJik

v2.4.0

6664ea9

FlauBERT, MMBT, UmBERTo, Dutch model, improved documentation, training from scratch, clean Python code

FlauBERT, MMBT, UmBERTo

MMBT was added to the list of available models, as the first multi-modal model to make it in the library. It can accept a transformer model as well as a computer vision model, in order to classify image and text. The MMBT Model is from Supervised Multimodal Bitransformers for Classifying Images and Text by Douwe Kiela, Suvrat Bhooshan, Hamed Firooz, Davide Testuggine (https://github.com/facebookresearch/mmbt/)
Added by @suvrat96.
A new Dutch BERT model was added under the wietsedv/bert-base-dutch-cased identifier. Added by @wietsedv. Model page
UmBERTo, a Roberta-based Language Model trained on large Italian Corpora. Model page
A new French model was added, FlauBERT, based on XLM. The FlauBERT model is from FlauBERT: Unsupervised Language Model Pre-training for French (https://github.com/getalp/Flaubert). Four checkpoints are added: small size, base uncased, base cased and large. Model page

New TF architectures (@jplu)

TensorFlow XLM-RoBERTa was added (@jplu )
TensorFlow CamemBERT was added (@jplu )

Python best practices (@aaugustin)

Greatly improved the quality of the source code by leveraging black, isort and flake8. A test was added, check_code_quality, which checks that the contributions respect the contribution guidelines related to those tools.
Similarly, optional imports are better handled and raise more precise errors.
Cleaned up several requirements files, updated the contribution guidelines and rely on setup.py for the necessary dev dependencies.
you can clean up your code for a PR with (more details in CONTRIBUTING.md):

make style
make quality

Documentation (@LysandreJik)

The documentation was uniformized and some better guidelines have been defined. This work is part of an ongoing effort of making transformers accessible to a larger audience. A glossary has been added, adding definitions for most frequently used inputs.

Furthermore, some tips are given concerning each model in their documentation pages.

The code samples are now tested on a weekly basis alongside other slow tests.

Improved repository structure (@aaugustin)

The source code was moved from ./transformers to ./src/transformers. Since it changes the location of the source code, contributors must update their local development environment by uninstalling and re-installing the library.

Python 2 is not supported anymore (@aaugustin )

Version 2.3.0 was the last version to support Python 2. As we begin the year 2020, official Python 2 support has been dropped.

Parallel testing (@aaugustin)

Tests can now be run in parallel

Sampling sequence generator (@rlouf, @thomwolf )

An abstract method was added to PreTrainedModel, which is implemented in all models trained with CLM. This abstract method is generate, which offers an API for text generation:

with/without a prompt
with/without beam search
with/without greedy decoding/sampling
with any (and combination) of top-k/top-p/penalized repetitions

Resuming training when interrupted (@bkkaggle )

Previously, when stopping a training the only saved values would be the model weights/configuration. Now the different scripts save several other values: the global step, current epoch, and the steps trained in the current epoch. When resuming a training, all those values will be leveraged to correctly resume the training.

This applies to the following scripts: run_glue, run_squad, run_ner, run_xnli.

CLI (@julien-c , @mfuntowicz )

Model upload

The CLI now has better documentation.
Files can now be removed.

Pipelines

Expose the number of underlying FastAPI workers
Async forward methods
Fixed the environment variables so that they don't fight each other anymore (USE_TF, USE_TORCH)

Training from scratch (@julien-c )

The run_lm_finetuning.py script now handles training from scratch.

Changes in the configuration (@julien-c )

The configuration files now contain the architecture they're referring to. There is no need to have the architecture in the file name as it was necessary before. This should ease the naming of community models.

New Auto models (@thomwolf )

A new type of AutoModel was added: AutoModelForPreTraining. This model returns the base model that was used during the pre-training. For most models it is the base model alongside a language modeling head, whereas for others it is a different model, e.g. BertForPreTraining for BERT.

HANS dataset (@ns-moosavi)

The HANS dataset was added to the examples. It allows for testing a model with adversarial evaluation of natural language.

[BREAKING CHANGES]

Ignored indices in PyTorch loss computing (@LysandreJik)

When using PyTorch, certain values can be ignored when computing the loss. In order for the loss function to understand which indices must be ignored, those have to be set to a certain value. Most of our models required those indices to be set to -1. We decided to set this value to -100 instead as it is PyTorch's default value. This removes the discrepancy between user-implemented losses and the losses integrated in the models.

Further help from @r0mainK.

Community additions/bug-fixes/improvements

Can now save and load PreTrainedEncoderDecoder objects (@TheEdoardo93)
RoBERTa now bears more similarity to the FairSeq implementation (@DomHudson, @thomwolf)
Examples now better reflect the defaults of the encoding methods (@enzoampil)
TFXLNet now has a correct input mask (@thomwolf)
run_squad was fixed to allow better training for XLNet (@importpandas )
tokenization performance improvement (3-8x) (@mandubian)
RoBERTa was added to the run_squad script (@erenup)
Fixed the special and added tokens tokenization (@vitaliyradchenko)
Fixed an issue with language generation for XLM when having a batch size superior to 1 (@patrickvonplaten)
Fixed an issue with the generate method which did not correctly handle the repetition penalty (@patrickvonplaten)
Completed the documentation for repeating_words_penalty_for_language_generation (@patrickvonplaten)
run_generation now leverages cached past input for models that have access to it (@patrickvonplaten)
Finally manage to patch a rarely occurring bug with DistilBERT, eventually named DistilHeisenBug or HeisenDistilBug (@LysandreJik, with the help of @julien-c and @thomwolf).
Fixed an import error in run_tf_ner (@karajan1001).
Feature conversion for GLUE now has improved logging messages (@simonepri)
Patched an issue with GPUs and run_generation (@alberduris)
Added support for ALBERT and XLMRoBERTa to run_glue
Fixed an issue with the DistilBERT tokenizer not loading correct configurations (@LysandreJik)
Updated the SQuAD for distillation script to leverage the new SQuAD API (@LysandreJik)
Fixed an issue with T5 related to its rp_bucket (@mschrimpf)
PPLM now supports repetition penalties (@IWillPull)
Modified the QA pipeline to consider all features for each example (@Perseus14)
Patched an issue with a file lock (@dimagalat @aaugustin)
The bias should be resized with the weights when resizing a vocabulary projection layer with a new vocabulary size (@LysandreJik)
Fixed misleading token type IDs for RoBERTa. It doesn't leverage token type IDs and this has been clarified in the documentation (@LysandreJik ) Same for XLM-R (@maksym-del).
Fixed the prepare_for_model when tensorizing and returning token type IDs (@LysandreJik).
Fixed the XLNet model which wouldn't work with torch 1.4 (@julien-c)
Fetch all possible files remotely (@julien-c )
BERT's BasicTokenizer respects never_split parameters (@DeNeutoy)
Add lower bound to tqdm dependency @brendan-ai2
Fixed glue processors failing on tensorflow datasets (@neonbjb)
XLMRobertaTokenizer can now be serialized (@brandenchan)
A classifier dropout was added to ALBERT (@peteriz)
The ALBERT configuration for v2 models were fixed to be identical to those output by Google (@LysandreJik )

Assets 2

20 Dec 21:40

LysandreJik

v2.3.0

a436574

Downstream NLP task API (feature extraction, text classification, NER, QA), Command-Line Interface and Serving – models: T5 – community-added models: Japanese & Finnish BERT, PPLM, XLM-R

New class `Pipeline` (beta): easily run and use models on down-stream NLP tasks

We have added a new class called Pipeline to simply run and use models for several down-stream NLP tasks.

A Pipeline is just a tokenizer + model wrapped so they can take human-readable inputs and output human-readable results.

The Pipeline will take care of :
tokenizing inputs strings => convert in tensors => run in the model => post-process output

Currently, we have added the following pipelines with a default model for each:

feature extraction (can be used with any pretrained and finetuned models)
inputs: strings/list of strings – output: list of floats (last hidden-states of the model for each token)
sentiment classification (DistilBert model fine-tuned on SST-2)
inputs: strings/list of strings – output: list of dict with label/score of the top class
Named Entity Recognition (XLM-R finetuned on CoNLL2003 by the awesome @stefan-it), and
inputs: strings/list of strings – output: list of dict with label/entities/position of the named-entities
Question Answering (Bert Large whole-word version fine-tuned on SQuAD 1.0)
inputs: dict of strings/list of dict of strings – output: list of dict with text/position of the answers

There are three ways to use pipelines:

in python:

from transformers import pipeline

# Test the default model for QA (Bert large finetuned on SQuAD 1.0)
nlp = pipeline('question-answering')
nlp(question= "Where does Amy live ?", context="Amy lives in Amsterdam.")
>>> {'answer': 'Amsterdam', 'score': 0.9657156007786263, 'start': 13, 'end': 21}

# Test a specific model for NER (XLM-R finetuned by @stefan-it on CoNLL03 English)
nlp = pipeline('ner', model='xlm-roberta-large-finetuned-conll03-english')
nlp("My name is Amy. I live in Paris.")
>>> [{'word': 'Amy', 'score': 0.9999586939811707, 'entity': 'I-PER'},
     {'word': 'Paris', 'score': 0.9999983310699463, 'entity': 'I-LOC'}]

in bash (using the command-line interface)

bash $ echo -e "Where does Amy live?\tAmy lives in Amsterdam" | transformers-cli run --task question-answering
{'score': 0.9657156007786263, 'start': 13, 'end': 22, 'answer': 'Amsterdam'}

as a REST API

transformers-cli serve --task question-answering

This new feature is currently in beta and will evolve in the coming weeks.

CLI tool to upload and share community models

Users can now create accounts on the huggingface.co website and then login using the transformers CLI. Doing so allows users to upload their models to our S3 in their respective directories, so that other users may download said models and use them in their tasks.

Users may upload files or directories.

It's been tested by @stefan-it for a German BERT and by @singletongue for a Japanese BERT.

New model architectures: T5, Japanese BERT, PPLM, XLM-RoBERTa, Finnish BERT

T5 (Pytorch & TF) (from Google) released with the paper Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer, by Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, Peter J. Liu.
Japanese BERT (Pytorch & TF) from CL-tohoku, implemented by @singletongue
PPLM (Pytorch) (from Uber AI) released with the paper Plug and Play Language Models: a Simple Approach to Controlled Text Generation by Sumanth Dathathri, Andrea Madotto, Janice Lan, Jane Hung, Eric Frank, Piero Molino, Jason Yosinski, Rosanne Liu.
XLM-RoBERTa (Pytorch & TF) (from FAIR, implemented by @stefan-it) released with the paper Unsupervised Cross-lingual Representation Learning at Scale by Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, Veselin Stoyanov
Finnish BERT (Pytorch & TF) (from TurkuNLP) released with the paper Multilingual is not enough: BERT for Finnish by Antti Virtanen, Jenna Kanerva, Rami Ilo, Jouni Luoma, Juhani Luotolahti, Tapio Salakoski, Filip Ginter, Sampo Pyysalo

Refactoring the SQuAD example

The run_squad script has been massively refactored. The reasons are the following:

it was made to work with only a few models (BERT, XLNet, XLM and DistilBERT), which had three different ways of encoding sequences. The script had to be individually modified in order to train different models, which would not scale as other models are added to the library.
the utilities did not rely on the QOL adjustments that were made to the encoding methods these past months.

It now leverages the full capacity of encode_plus, easing the addition of new models to the script. A new method squad_convert_examples_to_features encapsulates all of the tokenization.
This method can handle tensorflow_datasets as well as squad v1 json files and squad v2 json files.

ALBERT was added to the SQuAD script

BertAbs summarization

A contribution by @rlouf building on the encoder-decoder mechanism to do abstractive summarization.

Utilities to load the CNN/DailyMail dataset
BertAbs now usable as a traditional library model (using from_pretrained())
ROUGE evaluation

New Models

Additional architectures

@alexzubiaga added XLNetForTokenClassification and TFXLNetForTokenClassification

New model cards

Community additions/bug-fixes/improvements

Added mish activation function @digantamisra98
run_bertology.py was updated with correct imports and the ability to overwrite the cache
Training can be exited and relaunched safely, while keeping the epochs, global steps, scheduler steps and other variables in run_lm_finetuning.py @bkkaggle
Tests now run on cuda @aaugustin @julien-c
Cleaned up the pytorch to tf conversion script @thomwolf
Progress indicator improvements when downloading pre-trained models @leopd
from_pretrained() can now load from urls directly.
New tests to check that all files are accessible on HuggingFace's S3 @rlouf
Updated tf.shape and tensor.shape to all use shape_list @thomwolf
Valohai integration @thomwolf
Always use SequentialSampler in run_squad.py @ethanjperez
Stop using GPU when importing transformers @ondewo
Fixed the XLNet attention output @roskoN
Several QOL adjustments: removed dead code, deep cleaned tests and removed pytest dependency @aaugustin
Fixed an issue with the Camembert tokenization @thomwolf
Correctly create an encoder attention mask from the shape of the hidden states @rlouf
Fixed a non-deterministic behavior when encoding and decoding empty strings @pglock
Fixing tensor creation in encode_plus @LysandreJik
Remove usage of tf.mean which does not exist in TF2 @LysandreJik
A segmentation fault error was fixed (due to scipy 1.4.0) @LysandreJik
Start sunsetting support of Python 2
An example usage of Model2Model was added to the quickstart.

Assets 2

20 Dec 14:53

LysandreJik

v2.2.2

7bd11dd

Bug fixes

Patched error where the tokenizers would split the special tokens.

Assets 2

03 Dec 16:23

LysandreJik

v2.2.1

8101924

Bug fixes related to input shape in TensorFlow and tokenization messages

Input shapes

This patch fixes a bug related to the input shape in several models in TensorFlow.

Tokenization message

A tokenization message was too present and overloaded the output, hiding the relevant information. It was removed.

Assets 2

26 Nov 19:26

LysandreJik

v2.2.0

ae98d45

ALBERT, CamemBERT, DistilRoberta, GPT-2 XL, and Encoder-Decoder architectures

New model architectures: ALBERT, CamemBERT, GPT2-XL, DistilRoberta

Four new models have been added in v2.2.0

ALBERT (Pytorch & TF) (from Google Research and the Toyota Technological Institute at Chicago) released with the paper ALBERT: A Lite BERT for Self-supervised Learning of Language Representations, by Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, Radu Soricut.
CamemBERT (Pytorch) (from Facebook AI Research, INRIA, and La Sorbonne Université), as the first large-scale Transformer language model. Released alongside the paper CamemBERT: a Tasty French Language Model by Louis Martin, Benjamin Muller, Pedro Javier Ortiz Suarez, Yoann Dupont, Laurent Romary, Eric Villemonte de la Clergerie, Djame Seddah, and Benoît Sagot. It was added by @louismartin with the help of @julien-c.
DistilRoberta (Pytorch & TF) from @VictorSanh as the third distilled model after DistilBERT and DistilGPT-2.
GPT-2 XL (Pytorch & TF) as the last GPT-2 checkpoint released by OpenAI

Encoder-Decoder architectures

We welcome the possibility to create fully seq2seq models by incorporating Encoder-Decoder architectures using a PreTrainedEncoderDecoder class that can be initialized from pre-trained models. The base BERT class has be modified so that it may behave as a decoder.

Furthermore, a Model2Model class that simplifies the definition of an encoder-decoder when both encoder and decoder are based on the same model has been added. @rlouf

Benchmarks and performance improvements

Works by @tlkh and @LysandreJik aiming to benchmark the library models with different technologies: with TensorFlow and Pytorch, with mixed precision (AMP and FP-16) and with model tracing (Torchscript and XLA). A new section was created in the documentation: benchmarks pointing to Google sheets with the results.

Breaking changes

Tokenizers now add special tokens by default. @LysandreJik

New model templates

Model templates to ease the addition of new models to the library have been added. @thomwolf

Inputs Embeddings

A new input has been added to all models' forward (for Pytorch) and call (for TensorFlow) methods. These inputs_embeds are a direct embedded representation. This is useful as it gives more control over how to convert input_ids indices into associated vectors than the model's internal embedding lookup matrix. @julien-c

Getters and setters for input and output embeddings

A new API for the input and output embeddings are available. These methods are model-independent and allow easy acquisition/modification of the models' embeddings. @thomwolf

Additional architectures

New model architectures are available, namely: DistilBertForTokenClassification, CamembertForTokenClassification @stefan-it

Community additions/bug-fixes/improvements

The Fairseq RoBERTa model conversion script has been patched. @louismartin
einsum now runs in FP-16 in the library's examples @slayton58
In-depth work on the squad script for XLNet to reproduce the original paper's results @hlums
Additional improvements on the run_squad script by @WilliamTambellini, @orena1
The run_generation script has seen several improvements by @leo-du
The RoBERTaTensorFlow model has been patched for several use-cases: TPU and keras.fit @LysandreJik
The documentation is now versioned, links are available on the github readme @LysandreJik
The run_ner script has seen several improvements @mmaybeno, @oneraghavan, @manansanghi
The run_tf_glue script now works for all GLUE tasks @LysandreJik
The run_lm_finetuning script now correctly evaluates perplexity on MLM tasks @altsoph
An issue related to the XLM TensorFlow implementation's training has been fixed @tlkh
run_bertology has been updated to be closer to the run_glue example @adrianbg
Fixed added special tokens in decoded sequences @LysandreJik
Several performance improvements have been done to the tokenizers @iedmrc
A memory leak has been identified and patched in the library's schedulers @rlouf
Correct warning when encoding a sequence too long while specifying a maximum length @LysandreJik
Resizing the token embeddings now works as expected in the run_lm_finetuning script @iedmrc
The difference in versions between Pypi/source in order to run the examples has been clarified @rlouf

Assets 2

11 Oct 14:50

LysandreJik

v2.1.1

3ddce1d

CTRL, DistilGPT-2, Pytorch TPU, tokenizer enhancements, guideline requirements

New model architectures: CTRL, DistilGPT-2

Two new models have been added since release 2.0.

CTRL (from Salesforce) released with the paper CTRL: A Conditional Transformer Language Model for Controllable Generation, by Nitish Shirish Keskar, Bryan McCann, Lav R. Varshney, Caiming Xiong, Richard Socher. This model has been added to the library by @keskarnitish with the help of @thomwolf.
DistilGPT-2 (from HuggingFace), as the second distilled model after DistilBERT in version 1.2.0. Released alongside the paper DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Distillation

Several updates have been made to the distillation script, including the possibility to distill GPT-2 and to distill on the SQuAD task. By @VictorSanh.

Pytorch TPU support

The run_glue.py example script can now run on a Pytorch TPU.

Updates to example scripts

Several example scripts have been improved and refactored to use the full potential of the new tokenizer functions:

run_multiple_choice.py has been refactored to include encode_plus by @julien-c and @erenup
run_lm_finetuning.py has been improved with the help of @dennymarcels, @jinoobaek-qz and @LysandreJik
run_glue.py has been improved with the help of @brian41005

QOL enhancements on the tokenizer

Enhancements have been made on the tokenizers. Two new methods have been added: get_special_tokens_mask and truncate_sequences .

The former returns a mask indicating which tokens are special tokens in a token list, and which are tokens from the initial sequences. The latter truncate sequences according to a strategy.

Both of those methods are called by the encode_plus method, which itself is called by the encode method. The encode_plus now returns a larger dictionary which holds information about the special tokens, as well as the overflowing tokens.

Thanks to @julien-c, @thomwolf, and @LysandreJik for these additions.

New German BERT models

Support for new German BERT models (cased and uncased) from @stefan-it @dbmdz

Breaking changes

The two methods add_special_tokens_single_sequence and add_special_tokens_sequence_pair have been removed. They have been replaced by the single method build_inputs_with_special_tokens which has a more comprehensible name and manages both sequence singletons and pairs.
The boolean parameter truncate_first_sequence has been removed in tokenizers' encode and encode_plus methods, being replaced by a strategy in the form of a string: 'longest_first', 'only_second', 'only_first' or 'do_not_truncate' are accepted strategies.
When the encode or encode_plus methods are called with a specified max_length, the sequences will now always be truncated or throw an error if overflowing.

Guidelines and requirements

New contributing guidelines have been added, alongside library development requirements by @rlouf, the newest member of the HuggingFace team.

Community additions/bug-fixes/improvements

GLUE Processors have been refactored to handle inputs for all tasks coming from the tensorflow_datasets. This work has been done by @agrinh and @philipp-eisen.
The padding_idx is now correctly initialized to 1 in randomly initialized RoBERTa models. @ikuyamada
The documentation CSS has been adapted to work on older browsers. @TimYagan
An addition concerning the management of hidden states has been added to the README by @BramVanroy.
Integration of TF 2.0 models with other Keras modules @thomwolf
Past values can be opted-out @thomwolf

Assets 2

11 Oct 14:47

LysandreJik

v2.1.0

9c2e0a4

Superseded by v2.1.1

v2.1.0

Adds version 2.1.0 for PyPi

Assets 2

26 Sep 11:48

thomwolf

v2.0.0

1d646ba

v2.0.0 - TF 2.0/PyTorch interoperability, improved tokenizers, improved torchscript support

Name change: welcome 🤗 Transformers

Following the extension to TensorFlow 2.0, pytorch-transformers => transformers

Install with pip install transformers

Also, note that PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.

TensorFlow 2.0 - PyTorch

All the PyTorch nn.Module classes now have their counterpart in TensorFlow 2.0 as tf.keras.Model classes. TensorFlow 2.0 classes have the same name as their PyTorch counterparts prefixed with TF.

The interoperability between TensorFlow and PyTorch is actually a lot deeper than what is usually meant when talking about libraries with multiple backends:

each model (not just the static computation graph) can be seamlessly moved from one framework to the other during the lifetime of the model for training/evaluation/usage (from_pretrained can load weights saved from models saved in one or the other framework),
an example is given in the quick-tour on TF 2.0 and PyTorch in the readme in which a model is trained using keras.fit before being opened in PyTorch for quick debugging/inspection.

Remaining unsupported operations in TF 2.0 (to be added later):

resizing input embeddings to add new tokens
pruning model heads

TPU support

Training on TPU using free TPUs provided in the TensorFlow Research Cloud (TFRC) program is possible but requires to implement a custom training loop (not possible with keras.fit at the moment).
We will add an example of such a custom training loop soon.

Improved tokenizers

Tokenizers have been improved to provide extended encoding methods encoding_plus and additional arguments to encoding. Please refer to the doc for detailed usage of the new options.

Breaking changes

Positional order of some model keywords inputs changed (better TorchScript support)

To be able to better use Torchscript both on CPU and GPUs (see #1010, #1204 and #1195) the specific order of some models keywords inputs (attention_mask, token_type_ids...) has been changed.

If you used to call the models with keyword names for keyword arguments, e.g. model(inputs_ids, attention_mask=attention_mask, token_type_ids=token_type_ids), this should not cause any breaking change.

If you used to call the models with positional inputs for keyword arguments, e.g. model(inputs_ids, attention_mask, token_type_ids), you should double-check the exact order of input arguments.

Dependency requirements have changed

PyTorch is no longer in the requirements so don't forget to install TensorFlow 2.0 and/or PyTorch to be able to use (and load) the models.

Renamed method

The method add_special_tokens_sentence_pair has been renamed to the more appropriate name add_special_tokens_sequence_pair.
The same holds true for the method add_special_tokens_single_sentence which has been changed to add_special_tokens_single_sequence.

Community additions/bug-fixes/improvements

new German model (@Timoeller)
new script for MultipleChoice training (SWAG, RocStories...) (@erenup)
better fp16 support (@ziliwang and @bryant1410)
fix evaluation in run_lm_finetuning (@SKRohit)
fiw LM finetuning to prevent crashing on assert len(tokens_b)>=1 (@searchivarius)
Various doc and docstring fixes (@sshleifer, @Maxpa1n, @mattolson93, @T080)

Assets 4

04 Sep 12:18

LysandreJik

1.2.0

89fd345

DistilBERT, GPT-2 Large, XLM multilingual models, torch.hub, bug fixes

New model architecture: DistilBERT

Huggingface's new transformer architecture, DistilBERT described in Smaller, faster, cheaper, lighter: Introducing DistilBERT, a distilled version of BERT by Victor Sanh, Lysandre Debut and Thomas Wolf.

This new model architecture comes with two pretrained checkpoints:

distilbert-base-uncased: the base DistilBert model
distilbert-base-uncased-distilled-squad : DistilBert model fine-tuned with distillation on SQuAD.

New GPT2 checkpoint: GPT-2 large (774M parameters)

The third OpenAI GPT-2 checkpoint is available in the library: 774M parameters, 36 layers, and 20 heads.

New XLM multilingual checkpoints: 17 & 100 languages

We have added two new XLM models in 17 and 100 languages which obtain better performance than multilingual BERT on the XNLI cross-lingual classification task.

Back on `torch.hub` with all the architectures

Pytorch-Transformers torch.hub interface is based on Auto-Models which are generic classes designed to be instantiated using from_pretrained() in a model architecture guessed from the pretrained checkpoint name (ex AutoModel.from_pretrained('bert-base-uncased') will instantiate a BertModeland load the 'bert-case-uncased' checkpoint in it). They are currently 4 classes of Auto-Models:AutoModel, AutoModelWithLMHead, AutoModelForSequenceClassificationandAutoModelForQuestionAnswering`.

New dependency: `sacremoses`

Support for XLM is improved by carefully reproducing the original tokenization workflow (work by @shijie-wu in #1092). We now rely on sacremoses, a python port of Moses tokenizer, truecaser and normalizer by @alvations, for XLM word tokenization.

In a few languages (Thai, Japanese and Chinese) XLM tokenizer will require additional dependencies. These additional dependencies are optional at the library level. Using XLM tokenizer in these languages without the additional dependency will raise an error message with installation instructions. The additional optional dependencies are:

pythainlp: Thai tokenizer
kytea: Japanese tokenizer, wrapper of KyTea (Need external C++ compilation), used by the newly release XLM-17 & XLM-100
jieba: Chinese tokenizer *

* XLM used Stanford Segmenter. However, the wrapper (nltk.tokenize.stanford_segmenter) are slow due to JVM overhead, and it will be deprecated. Jieba is a lot faster and pip-installable. But there is some mismatch with the Stanford Segmenter. A workaround could be having an argument to allow users to segment the sentence by themselves and bypass the segmenter. As a reference, I also include nltk.tokenize.stanford_segmenter in this PR.

Bug fixes and improvements to the library modules

Bertology script has seen major improvements (@tuvuumass )
Iterative tokenization now faster and accept arbitrary numbers of added tokens (@samvelyan)
Added RoBERTa to AutoModels and AutoTokenizers (@LysandreJik )
Added GPT-2 Large 774M model (@thomwolf )
Added language model fine-tuning with GPT/GPT-2 (CLM), BERT/RoBERTa (MLM) (@LysandreJik @thomwolf )
Multi-GPU training has been patched (@FeiWang96 )
Scripts are updated to reflect Pytorch 1.1.0 changes (scheduler, optimizer) (@Morizeyao, @adai183 )
Updated the in-depth BERT fine-tuning scripts to pytorch-transformers (@Morizeyao )
Models saved with pruned heads are now saved and reloaded correctly (implemented for GPT, GPT-2, BERT, RoBERTa, XLM) (@LysandreJik @thomwolf)
Add proxies and force_download options to from_pretrained() method to be able to use proxies and update cached models/tokenizers (@thomwolf)
Add shortcut to each special tokens with _id properties (e.g. tokenizer.cls_token_id for the id in the vocabulary of tokenizer.cls_token) (@thomwolf)
Fix GPT2 and RoBERTa tokenizer so that sentences to be tokenized always begins with at least one space (see note by fairseq authors) (@thomwolf)
Fix and clean up byte-level BPE tests (@thomwolf)
Update the test classes for OpenAI GPT and GPT-2 so that these models are tested against common tests. (@LysandreJik )
Fix a warning raised when the decode method is called for a model with no sep_token like GPT-2 (@LysandreJik )
Updated the tokenizers saving method (@boy2000-007man)
SpaCy tokenizers have been updated in the tokenizers (@GuillemGSubies )
Stable EnvironmentErrors have been added to utility files (@abhishekraok )
Fixed distributed barrier hang (@VictorSanh )
Encoding functions now return the input tokens instead of throwing an error when not implemented in child class (@LysandreJik )
Change layer norm code to PyTorch's native layer norm (@dhpollack)
Improved tokenization for XLM for multilingual inputs (@shijie-wu)
Add language input and access to language to id conversion in XLM tokenizer (@thomwolf)
Add pretrained configuration properties for tokenizers with serialization logic (saving/reloading tokenizer configuration) (@thomwolf)
Added new AutoModels: AutoModelWithLMHead, AutoModelForSequenceClassification, AutoModelForQuestionAnswering (@LysandreJik)
Torch.hub is now based on AutoModels (@LysandreJik @thomwolf)
Fix Transformer-XL attention mask dtype to be bool (@CrafterKolyan)
Adding DistilBert model architecture and checkpoints (@VictorSanh @LysandreJik @thomwolf)
Fixes to DistilBert configuration and training script (@stefan-it)
Fix XLNet attention mask for fp16 (@ziliwang)
Documentation auto-deploy (@LysandreJik)
Fix to add a tuple of tokens (@epwalsh)
Update fp16 apex implementation in scripts (@anhnt170489)
Fix XLNet bias resizing when adding/removing tokens (@LysandreJik)
Fix tokenizer reloading in example scripts (@rabeehk)
Fix byte-level decoding error when using added tokens (@thomwolf @LysandreJik)
Fix epsilon value in RoBERTa pretrained checkpoints (@julien-c)

Assets 5

Releases: huggingface/transformers

Patch v2.4.1: FlauBERT for AutoModel and AutoTokenizer