-
Notifications
You must be signed in to change notification settings - Fork 472
[Help Wanted] How to apply BPE when input contains word features? #534
Comments
Hello Howard, we simply never did that, this is a bug. However, what would you expect for the split word feature? just duplicate it or change it to something else? |
@jsenellart Hello, I expect the word features are not effected by the tokenizer. I think, yes, just duplicate it by concatenating back to the BPE-tokenized word. |
What I want to achieve is like the paper "Linguistic Input Features Improve Neural Machine Translation" did: |
Like the above example, as usual, my output will be like (I list them instead of concatenating by space for readability):
The first entity before "|" is BPE-tokenized word, then the followings are word features as usuall. I hope the tokenizer can support this input and just tokenize the first entity. |
I may figure out the feasible but a little tricky way to achieve what I want:
|
Sorry, I did not explain: it cannot work that way at least to train a model.
(I did not see your latest comment) - the best in that case is to add a feature like what you propose:
I propose to introduce 3 modes (1, 2, 4) to cover these options. |
@jsenellart Thanks, that will be great! |
I wrote a feasible example python script that alleviate my requirement: def align_features_for_subword_raw(raw_no_feat_subword, raw_feat,
feature_delimiter = '│',
subword_delimiter = '■',
keep_word=False,
add_subword_indicator_feature=False):
feature_start_pos = 0 if keep_word else 1
word_curser = 0
subwords_cache = []
subwords_with_features = []
words_with_features = [words_with_feature.split(feature_delimiter) for words_with_feature in raw_feat.split()]
for word in raw_no_feat_subword.split():
# Split words
subword_with_features = [word] + words_with_features[word_curser][feature_start_pos:]
subwords = word.split(subword_delimiter)
# The word is not a subword.
if len(subwords) < 2 and len(subwords_cache) == 0:
if add_subword_indicator_feature: subword_with_features.append('O')
word_curser += 1
# The word is a last subword of a word.
elif len(subwords) < 2 and len(subwords_cache) > 0:
if add_subword_indicator_feature: subword_with_features.append('E')
del subwords_cache[:]
# The word is a beginning subword of a word.
elif len(subwords) >= 2 and len(subwords_cache) == 0:
if add_subword_indicator_feature: subword_with_features.append('B')
subwords_cache.append(word)
# The word is a middle subword of a word.
elif len(subwords) >= 2 and len(subwords_cache) > 0:
if add_subword_indicator_feature: subword_with_features.append('I')
subwords_cache.append(word)
subwords_with_features.append(subword_with_features)
# Detokenize back to raw.
raw_feat_subword = ' '.join([feature_delimiter.join(subword_with_features) for subword_with_features in subwords_with_features])
return raw_feat_subword
line_no_feat = "Unforturntly , almost older people can not use internet , in spite of benefit of internet ."
line_pos_dep = "Unforturntly│PROPN│advmod ,│PUNCT│punct almost│ADV│advmod older│ADJ│amod people│NOUN│nsubj can│VERB│aux not│ADV│neg use│VERB│ROOT internet│NOUN│dobj ,│PUNCT│punct in│ADP│prep spite│NOUN│pobj of│ADP│prep benefit│NOUN│pobj of│ADP│prep internet│NOUN│pobj .│PUNCT│punct"
line_no_feat_bpe = "Un■ for■ tur■ n■ tly , almost older people can not use internet , in spite of benefit of internet ." Simply call: align_features_for_subword_raw(line_no_feat_bpe,
line_pos_dep,
keep_word=True,
add_subword_indicator_feature=True) returns: 'Un■│Unforturntly│PROPN│advmod│B for■│Unforturntly│PROPN│advmod│I tur■│Unforturntly│PROPN│advmod│I n■│Unforturntly│PROPN│advmod│I tly│Unforturntly│PROPN│advmod│E ,│Unforturntly│PROPN│advmod│O almost│,│PUNCT│punct│O older│almost│ADV│advmod│O people│older│ADJ│amod│O can│people│NOUN│nsubj│O not│can│VERB│aux│O use│not│ADV│neg│O internet│use│VERB│ROOT│O ,│internet│NOUN│dobj│O in│,│PUNCT│punct│O spite│in│ADP│prep│O of│spite│NOUN│pobj│O benefit│of│ADP│prep│O of│benefit│NOUN│pobj│O internet│of│ADP│prep│O .│internet│NOUN│pobj│O' |
Can you please explain, what does the (B, O, I, E) tag represents? Also, why the feature of :
Are you trying to include the feature of the previous word to the current word? What happens if the BPE is applied to the following remaining words instead of only just the word "Unforturntly". Thank you. |
I found that current
tools/tokenize.lua
will tokenize word features that concatenated with token when usingbpe_model
, resulting unwanted tokenization results. For example:I | FEAT1 am | FEAT2 Howard | FEAT3
However, the current
tools/tokenize.lua
will not ignore the word features and view "word+word features" as a single token. The result will be like:I ■ | ■ FEAT am ■ | ■ FEAT How ■ ard ■ | ■ FEAT
I am wondering how do I apply BPE when my input file contains word features?
I don't want the tokenization effect the word features.
The text was updated successfully, but these errors were encountered: