Skip to content
Richard Wang edited this page Sep 7, 2020 · 3 revisions

Use this page to discuss how to add HuggingFace support. Feel free to link to forum threads, issues, etc.

Issues/Recommendations

  1. Transform/ItemTransform support for dictionaries in addition to tensors (or list of tensors).
  2. Learner.summary support for dictionaries
  3. Support for Masked Language Modeling (MLM)
  4. Support for the various MLM denoising objectives
  5. Integration with huggingface nlp library (useful for any NLU task, not just transformers)

1. Transform and ItemTransform class dictionary support

Currently, fastai presupposes that a "thing" is represented by a single tensor or list of tensors. However, in huggingface, a "thing" (a sequence of text) is represented by multiple tensors (e.g., input_ids, attention_mask, token_type_ids, etc...) and encapsulated in a dictionary object. fastai has no problem returning such a dictionary from the encodes method of Transform or ItemTransform instances for modeling purposes ... but has problems dealing with them when it comes to Learner.summary and the various show methods like show_batch and show_results.

One attempt to solve this comes from the blurr library, but requires a custom batch transform pretty much just for the purpose of working with the dictionary returned from it's HF_TokenizerTransform object

Possible Solutions:

  1. Use a convention, whereby if the type is a dictionary of tensors, the FIRST one is the one to use for representing the thing. So in HF, that is the input_ids, which is exactly what you want to use for showing.

  2. Implement something like a def transforms_repr function (optional), that returns a single tensor representation of your thing. If the method exists in your Transform, it is used in any of the show/summary methods to return a single tensor that will work with those methods.

  3. Pass a "Key" to the transform that defines what item in the dictionary should "represent" the thing for summary/showing purposes?


2. Learner.summary support for dictionaries

This should be fairly straight-forward change once the Transform and ItemTransform classes are updated (see above). What the blurr library does now is define its own blurr_summary methods that essentially change only a single line to make it work with a dictionary. See def blurr_summary(self:Learner) as well as the @patched method for nn.Module. The only real change that needed to be made is on line 66 ...

inp_sz = _print_shapes(apply(lambda x:x.shape, xb[0]['input_ids']), bs)

... but unfortunately, it requires completely overriding both of those methods currently.


3. Support for Masked Language Modeling (MLM)

There is no built-in support for MLM in fastai currently; the only support is for causal LM used by ULMFit and transform models like GPT-2. As most transformer models utilized a MLM pre-training objective, it would be nice to have support for doing this in fastai so that where possible, folks may fine-tune those LMs in a fashion similar to what you can do in fastai with ULMFit. This may be prohibitive in certain transformers where the size of models are just too big.

You can also take a look at richarddwang/electra_pytorch, it implements masking mechanism as a Callback with before_batch method (dynamic masking as described in RoBerta).


4. Support for the various MLM denoising objectives

The MLM objectives vary for the various transformers. The T5 and BART papers include a nice descriptions of most of them. They include:

  1. Token Masking
  2. Token Deletion
  3. Token Infilling
  4. Sentence Permutation
  5. Document Rotation

See section 2.2 in the BART paper and section 3.14 in the T5 paper for visuals and detailed explanations for each.


5. Integration with huggingface nlp library (useful for any NLU task, not just transformers)

Richard Wang created a library hugdatafast to make fastai users able to get Dataloaders from huggingface/nlp, and be able to show_batch and show_results. There're still some room for improvement.

  • huggingface/nlp is gonna increase its support for multi-modal dataset, but hugdatafast only supports text dataset.
  • hugdatafast currently doesn't cover the metrics part of huggingface/nlp`.

Relevant Forum Threads

Relevant Projects