transformer example and texar-pytorch/bin/utils should be improved #180

gpengzhi · 2019-08-30T04:07:43Z

In texar-pytorch/bin/utils, sentencepiece is a package instead of a tokenization method, so sentencepiece encoding is not a very accurate way to describe the tokenization method. sentencepiece includes two sub-word tokenization methods: byte-pair-encoding (BPE)[Sennrich et al.] and unigram language model [Kudo.].

There is no Word Piece Model (WPM) pipeline in sentencepiece, and the code we used

spm_train --input=train.src,train.tgt --vocab_size 32000 --model_prefix=wpm-codes

is actually using unigram language model. Here, Unigram is the default method.

transformer example need to be updated accordingly.

The text was updated successfully, but these errors were encountered:

gpengzhi added enhancement New feature or request topic: examples Issue about examples labels Aug 30, 2019

gpengzhi self-assigned this Aug 31, 2019

gpengzhi mentioned this issue Sep 11, 2019

Add BPETokenizer #204

Open

gpengzhi assigned swapnull7 and unassigned gpengzhi Dec 6, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

transformer example and texar-pytorch/bin/utils should be improved #180

transformer example and texar-pytorch/bin/utils should be improved #180

gpengzhi commented Aug 30, 2019 •

edited

transformer example and texar-pytorch/bin/utils should be improved #180

transformer example and texar-pytorch/bin/utils should be improved #180

Comments

gpengzhi commented Aug 30, 2019 • edited

gpengzhi commented Aug 30, 2019 •

edited