Skip to content
This repository has been archived by the owner on Jun 10, 2021. It is now read-only.

Error BPE tokenizer: tokenize.lua:87: attempt to concatenate upvalue 'line' #568

Open
i55code opened this issue May 27, 2019 · 3 comments
Open

Comments

@i55code
Copy link

i55code commented May 27, 2019

Hi, I am trying to tokenize a file using existing bpe file, however, it produces the error:
th tools/tokenize.lua -bpe_model $bpemodel < $DATA/$inputfile > $DATA/$inputfile.tok

Feel free to advise me what to do. Thanks!

No BPE options read from model, falling back to cmd or default options
/install/bin/luajit: ../install/share/lua/5.1/threads/threads.lua:183: [thread 1 callback] tools/tokenize.lua:87: attempt to concatenate upvalue 'line' (a nil value)
stack traceback:
tools/tokenize.lua:87: in function <tools/tokenize.lua:64>
[C]: in function 'xpcall'
/install/share/lua/5.1/threads/threads.lua:234: in function 'callback'
/install/share/lua/5.1/threads/queue.lua:65: in function </install/share/lua/5.1/threads/queue.lua:41>
[C]: in function 'pcall'
//install/share/lua/5.1/threads/queue.lua:40: in function 'dojob'
[string " local Queue = require 'threads.queue'..."]:13: in main chunk
stack traceback:
[C]: in function 'error'
/install/share/lua/5.1/threads/threads.lua:183: in function 'dojob'
/install/share/lua/5.1/threads/threads.lua:264: in function 'synchronize'
tools/tokenize.lua:101: in main chunk
[C]: in function 'dofile'
/install/lib/luarocks/rocks/trepl/scm-1/bin/th:150: in main chunk
[C]: at 0x00405d50

@guillaumekln
Copy link
Collaborator

guillaumekln commented May 27, 2019

Hi,

There is a typo in the code reporting the error. Can you try changing line to aline in tools/tokenize.lua:87 to see if the error message is helpful?

@i55code
Copy link
Author

i55code commented May 28, 2019

Hi,

Thanks! I changed line to aline, and the error message is the unicode error, and Lua does not have any support for unicode (other than accepting any byte value in strings).

But my file contains languages like Turkish, which has a lot of special symbols.

What would you recommend me to do if I would like to BPE it using an existing BPE file?

Thanks!

@guillaumekln
Copy link
Collaborator

Maybe try this tokenizer https://github.com/OpenNMT/Tokenizer

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

No branches or pull requests

2 participants