Skip to content
This repository has been archived by the owner on Oct 31, 2023. It is now read-only.

Why codes file is empty.? #101

Open
ykkhan opened this issue Sep 4, 2020 · 4 comments
Open

Why codes file is empty.? #101

ykkhan opened this issue Sep 4, 2020 · 4 comments

Comments

@ykkhan
Copy link

ykkhan commented Sep 4, 2020

I am facing this error

Applying BPE to valid and test files...
Loading vocabulary from /home/UnsupervisedMT/NMT/data/mono/vocab.en.1500 ...
Read 26726 words (93 unique) from vocabulary file.
Loading codes from /home/UnsupervisedMT/NMT/data/mono/bpe_codes ...
Read 0 codes from the codes file.
Loading vocabulary from /home/UnsupervisedMT/NMT/data/para/vs11.txt ...
Read 0 words (0 unique) from text file.
Applying BPE to /home/UnsupervisedMT/NMT/data/para/vs11.txt ...
Output memory map failed : 22.

where am making error?

@ykkhan ykkhan changed the title Output memory map failed : 22 Why codes file is empty.? Sep 8, 2020
@glample
Copy link
Contributor

glample commented Sep 8, 2020

This kind of issues with fastBPE typically happens when you have too many BPE codes and a too small vocabulary.
What is your vocabulary size, and how many BPE codes are you trying to compute?

@ykkhan
Copy link
Author

ykkhan commented Sep 8, 2020

Ok, I get it.
can you suggest how many code should i compute for dataset has 610 sentence in each file (train, test, valid). it is also not extract correct vocabulary. its only extract alphabets instead of words. as shown in image. it is computing just 93 vocabulary size.

Screenshot from 2020-09-08 13-23-17

for 610 sentences, I have tried different values in almost 200 to 1500 for BPE code. for each code value it is giving same issue.
Although dataset is very small, but at this time my task is to run this technique successfully. Next I will increase dataset.

@glample
Copy link
Contributor

glample commented Sep 10, 2020

610 sentences is very small. I would simply use word level and not BPE in this case.
BPE is useful to reduce the vocabulary size and to avoid computing a softmax over hundred of thousands of elements. But in your case the vocabulary will be very small so you probably don't need BPE.

@ykkhan
Copy link
Author

ykkhan commented Sep 12, 2020

ok, thanks for replying. can you guide me little bit more. which part of data.enfr.sh I should remove, what to insert in the code to this.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants