Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fastBPE step in embed.py fails if no tokenization #121

Open
bricksdont opened this issue Jan 27, 2020 · 0 comments · May be fixed by #122
Open

fastBPE step in embed.py fails if no tokenization #121

bricksdont opened this issue Jan 27, 2020 · 0 comments · May be fixed by #122

Comments

@bricksdont
Copy link

If there is no tokenization with embed.py, running the script fails:

cat input.de | python tools/laser/source/embed.py --encoder tools/laser/models/bilstm.93langs.2018-12-26.pt --bpe-codes tools/laser/models/93langs.fcodes --output embedded.de --verbose
 - Encoder: loading /net/cephfs/scratch/mathmu/laser-contra/tools/laser/models/bilstm.93langs.2018-12-26.pt
 - fast BPE: processing
Loading codes from /net/cephfs/scratch/mathmu/laser-contra/tools/laser/models/93langs.fvocab ...
fast: fastBPE/fastBPE.hpp:455: void fastBPE::readCodes(const char*, std::unordered_map<std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> >, unsigned int, fastBPE::pair_hash>&, std::unordered_map<std::__cxx11::basic_string<char>, std::pair<std::__cxx11::basic_string<char>, std::__cxx11::basic_string<char> > >&): Assertion `splits.size() == 3' failed.
Aborted (core dumped)

This is due to the fact that the input file is sys.stdin before tokenization. The tokenizing step can handle input from STDIN, but the fastBPE step tries to execute the following command:

./fast applybpe [TWO SPACES HERE] /tmp/tmpgh9sgiy_/bpe \
    tools/laser/models/93langs.fcodes \
    tools/laser/models/93langs.fvocab

While the general recipe is

./fast applybpe output input codes vocab

The temporary output file /tmp/tmpgh9sgiy_/bpe is mistaken for the codes file, which makes this assertion fail.

@bricksdont bricksdont linked a pull request Jan 27, 2020 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant