add new Hugging Face tokenizers for text features #17

jim-schwoebel · 2020-01-14T23:15:32Z

https://github.com/huggingface/tokenizers

# Tokenizers provides ultra-fast implementations of most current tokenizers:
>>> from tokenizers import (ByteLevelBPETokenizer,
                            BPETokenizer,
                            SentencePieceBPETokenizer,
                            BertWordPieceTokenizer)
# Ultra-fast => they can encode 1GB of text in ~20sec on a standard server's CPU
# Tokenizers can be easily instantiated from standard files
>>> tokenizer = BertWordPieceTokenizer("bert-base-uncased-vocab.txt", lowercase=True)
Tokenizer(vocabulary_size=30522, model=BertWordPiece, add_special_tokens=True, unk_token=[UNK], 
          sep_token=[SEP], cls_token=[CLS], clean_text=True, handle_chinese_chars=True, 
          strip_accents=True, lowercase=True, wordpieces_prefix=##)

# Tokenizers provide exhaustive outputs: tokens, mapping to original string, attention/special token masks.
# They also handle model's max input lengths as well as padding (to directly encode in padded batches)
>>> output = tokenizer.encode("Hello, y'all! How are you 😁 ?")
Encoding(num_tokens=13, attributes=[ids, type_ids, tokens, offsets, attention_mask, special_tokens_mask, overflowing, original_str, normalized_str])
>>> print(output.ids, output.tokens, output.offsets)
[101, 7592, 1010, 1061, 1005, 2035, 999, 2129, 2024, 2017, 100, 1029, 102]
['[CLS]', 'hello', ',', 'y', "'", 'all', '!', 'how', 'are', 'you', '[UNK]', '?', '[SEP]']
[(0, 0), (0, 5), (5, 6), (7, 8), (8, 9), (9, 12), (12, 13), (14, 17), (18, 21), (22, 25), (26, 27),
 (28, 29), (0, 0)]
# Here is an example using the offsets mapping to retrieve the string coresponding to the 10th token:
>>> output.original_str[output.offsets[10]]
'😁'

jim-schwoebel · 2020-01-14T23:15:42Z

Train new vocabularies and tokenize, using today's most used tokenizers.
Extremely fast (both training and tokenization), thanks to the Rust implementation. Takes less than 20 seconds to tokenize a GB of text on a server's CPU.
Easy to use, but also extremely versatile.
Designed for research and production.
Normalization comes with alignments tracking. It's always possible to get the part of the original sentence that corresponds to a given token.
Does all the pre-processing: Truncate, Pad, add the special tokens your model needs.

jim-schwoebel changed the title ~~add new tokenizers for text features~~ add new Hugging Face tokenizers for text features Aug 2, 2020

jim-schwoebel added the enhancement New feature or request label Aug 7, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add new Hugging Face tokenizers for text features #17

add new Hugging Face tokenizers for text features #17

jim-schwoebel commented Jan 14, 2020 •

edited

jim-schwoebel commented Jan 14, 2020 •

edited

add new Hugging Face tokenizers for text features #17

add new Hugging Face tokenizers for text features #17

Comments

jim-schwoebel commented Jan 14, 2020 • edited

jim-schwoebel commented Jan 14, 2020 • edited

jim-schwoebel commented Jan 14, 2020 •

edited

jim-schwoebel commented Jan 14, 2020 •

edited