Skip to content
This repository has been archived by the owner on Jul 26, 2019. It is now read-only.

What's the meaning of TextEncoder.BERT_SPECIAL_COUNT, TextEncoder.TextEncoder.BERT_UNUSED_COUNT #16

Open
ChiuHsin opened this issue Jan 14, 2019 · 4 comments

Comments

@ChiuHsin
Copy link

When I use the BERT-keras, I don't understand this part:
class TextEncoder: PAD_OFFSET = 0 MSK_OFFSET = 1 BOS_OFFSET = 2 DEL_OFFSET = 3 # delimiter EOS_OFFSET = 4 SPECIAL_COUNT = 5 NUM_SEGMENTS = 2 BERT_UNUSED_COUNT = 99 # bert pretrained models BERT_SPECIAL_COUNT = 4 # they don't have DEL
Why would you set it up like this?
and the BERT_UNUSED_COUNT = 99 BERT_SPECIAL_COUNT = 4 are used in load_google_bert.

@Separius
Copy link
Owner

Hi,
There are some special tokens in the vocabulary(for example BOS stands for Beginning Of Sentence) and we can either put them at the beginning of a lookup table(embedding) or at the end. I decided to put them at the beginning.
And for the "UNUSED_COUNT" you can check the vocab files in pretrained BERT models.

@Separius
Copy link
Owner

Ah, you might be confused by their usage, right?
Let's say you want to feed a sentence into your network, so you should add the BOS and EOS tokens to your sentence and you should know their locations in the embedding table

@ChiuHsin
Copy link
Author

I see, but when I load_google_bert model, the vocab_size = vocab_size - TextEncoder.BERT_SPECIAL_COUNT - TextEncoder.BERT_UNUSED_COUNT, but it doesn't match when w_id ==2 'weights[w_id][vocab_size + TextEncoder.EOS_OFFSET] = saved[3 + TextEncoder.BERT_UNUSED_COUNT] ' this line can not load the weight.

@Separius
Copy link
Owner

Separius commented Feb 2, 2019

@ChiuHsin I guess you are right, and it seems that you were able to solve it(based on the other issue you posted)
can you please send a pull request to correct this problem?
thanks!

@Separius Separius closed this as completed Feb 2, 2019
@Separius Separius reopened this Feb 2, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants