[Help Wanted] How to apply BPE when input contains word features? #534

howardyclo · 2018-04-08T06:10:27Z

I found that current tools/tokenize.lua will tokenize word features that concatenated with token when using bpe_model, resulting unwanted tokenization results. For example:

Input: I | FEAT1 am | FEAT2 Howard | FEAT3
[Updated] The expected output should be like (mentioned by @jsenellart bellow):
- (new X feature) I|FEAT1 am|FEAT2 How|X ￭ard|FEAT3
- (duplicate FEAT3 feature) I|FEAT1 am|FEAT2 How|FEAT3 ￭ard|FEAT3
- (mix) I|FEAT1 am|FEAT2 How|XFEAT3 ￭ard|FEAT3

However, the current tools/tokenize.lua will not ignore the word features and view "word+word features" as a single token. The result will be like:

Unwanted output: I ￭ | ￭ FEAT am ￭ | ￭ FEAT How ￭ ard ￭ | ￭ FEAT

I am wondering how do I apply BPE when my input file contains word features?
I don't want the tokenization effect the word features.

The text was updated successfully, but these errors were encountered:

jsenellart · 2018-04-08T06:24:09Z

Hello Howard, we simply never did that, this is a bug. However, what would you expect for the split word feature? just duplicate it or change it to something else?

howardyclo · 2018-04-08T06:55:21Z

@jsenellart Hello, I expect the word features are not effected by the tokenizer. I think, yes, just duplicate it by concatenating back to the BPE-tokenized word.

howardyclo · 2018-04-08T06:59:46Z

What I want to achieve is like the paper "Linguistic Input Features Improve Neural Machine Translation" did:

howardyclo · 2018-04-08T07:06:32Z

Like the above example, as usual, my output will be like (I list them instead of concatenating by space for readability):

Le ￭ | Lenoidas | B | NNP | nsubj
oni ￭ | Lenoidas | I | NNP | nsubj
das | Lenoidas | E | NNP | nsubj
beg ￭ | beg | B | VBD | root
ged | beg | E | VBD | root
in | in | O | IN | prep
the | the | O | DT | det
arena | arena | O | NN | pobj
. | . | O | . | root

The first entity before "|" is BPE-tokenized word, then the followings are word features as usuall. I hope the tokenizer can support this input and just tokenize the first entity.

howardyclo · 2018-04-08T07:18:31Z

I may figure out the feasible but a little tricky way to achieve what I want:

Learn BPE codes from word-level corpus with no word features by tools/learn_bpe.lua.
Tokenize corpus with no word features using previously learned BPE codes (from 1.) by tools/tokenize.lua. We must enable joiner_annotate in order to detokenize. Now, we have a subword-level corpus with no word features.
Build subword-level vocabulary for subword-level corpus with no word features (from 2.) by tools/build_vocab.lua.
Build separate word feature vocabularies using additional word-level corpus with word features by tools/build_vocab.lua.
Now, we have 2 separate vocabularies from 3. and 4., we can now build our training data by preprocess.lua with {src,tgt}_vocab set to the vocabulary from 3., and with features_vocabs_prefix set to the feature vocabulary from 4.

jsenellart · 2018-04-08T07:22:48Z

Sorry, I did not explain: it cannot work that way at least to train a model.

I|FEAT1 am|FEAT2 How ￭ard|FEAT3
will not work (preprocess will fail) because How does not have feature, so we need to put one like:

(new X feature) I|FEAT1 am|FEAT2 How|X ￭ard|FEAT3
(duplicate FEAT3 feature) I|FEAT1 am|FEAT2 How|FEAT3 ￭ard|FEAT3
(mix) I|FEAT1 am|FEAT2 How|XFEAT3 ￭ard|FEAT3

(I did not see your latest comment) - the best in that case is to add a feature like what you propose:

(new feature: B, O, I, E) I|O|FEAT1 am|O|FEAT2 How|B|XFEAT3 ￭ard|E|FEAT3

I propose to introduce 3 modes (1, 2, 4) to cover these options.

howardyclo · 2018-04-08T07:38:51Z

@jsenellart Thanks, that will be great!

howardyclo · 2018-04-09T06:06:13Z

I wrote a feasible example python script that alleviate my requirement:

def align_features_for_subword_raw(raw_no_feat_subword, raw_feat,
                                   feature_delimiter = '￨',
                                   subword_delimiter = '￭',
                                   keep_word=False,
                                   add_subword_indicator_feature=False):

    feature_start_pos = 0 if keep_word else 1
    word_curser = 0
    subwords_cache = []
    subwords_with_features = []
    words_with_features = [words_with_feature.split(feature_delimiter) for words_with_feature in raw_feat.split()]

    for word in raw_no_feat_subword.split():
        # Split words
        subword_with_features = [word] + words_with_features[word_curser][feature_start_pos:]
        subwords = word.split(subword_delimiter)
        
        # The word is not a subword.
        if len(subwords) < 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('O')
            word_curser += 1

        # The word is a last subword of a word.
        elif len(subwords) < 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('E')
            del subwords_cache[:]

        # The word is a beginning subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('B')
            subwords_cache.append(word)
        
        # The word is a middle subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('I')
            subwords_cache.append(word)
        
        subwords_with_features.append(subword_with_features)

    # Detokenize back to raw.
    raw_feat_subword = ' '.join([feature_delimiter.join(subword_with_features) for subword_with_features in subwords_with_features])
    return raw_feat_subword

line_no_feat = "Unforturntly , almost older people can not use internet , in spite of benefit of internet ."
line_pos_dep = "Unforturntly￨PROPN￨advmod ,￨PUNCT￨punct almost￨ADV￨advmod older￨ADJ￨amod people￨NOUN￨nsubj can￨VERB￨aux not￨ADV￨neg use￨VERB￨ROOT internet￨NOUN￨dobj ,￨PUNCT￨punct in￨ADP￨prep spite￨NOUN￨pobj of￨ADP￨prep benefit￨NOUN￨pobj of￨ADP￨prep internet￨NOUN￨pobj .￨PUNCT￨punct"
line_no_feat_bpe = "Un￭ for￭ tur￭ n￭ tly , almost older people can not use internet , in spite of benefit of internet ."

Simply call:

align_features_for_subword_raw(line_no_feat_bpe,
                               line_pos_dep,
                               keep_word=True,
                               add_subword_indicator_feature=True)

returns:

'Un￭￨Unforturntly￨PROPN￨advmod￨B for￭￨Unforturntly￨PROPN￨advmod￨I tur￭￨Unforturntly￨PROPN￨advmod￨I n￭￨Unforturntly￨PROPN￨advmod￨I tly￨Unforturntly￨PROPN￨advmod￨E ,￨Unforturntly￨PROPN￨advmod￨O almost￨,￨PUNCT￨punct￨O older￨almost￨ADV￨advmod￨O people￨older￨ADJ￨amod￨O can￨people￨NOUN￨nsubj￨O not￨can￨VERB￨aux￨O use￨not￨ADV￨neg￨O internet￨use￨VERB￨ROOT￨O ,￨internet￨NOUN￨dobj￨O in￨,￨PUNCT￨punct￨O spite￨in￨ADP￨prep￨O of￨spite￨NOUN￨pobj￨O benefit￨of￨ADP￨prep￨O of￨benefit￨NOUN￨pobj￨O internet￨of￨ADP￨prep￨O .￨internet￨NOUN￨pobj￨O'

LinuxBeginner · 2019-08-12T08:54:37Z

I wrote a feasible example python script that alleviate my requirement:

def align_features_for_subword_raw(raw_no_feat_subword, raw_feat,
                                   feature_delimiter = '￨',
                                   subword_delimiter = '￭',
                                   keep_word=False,
                                   add_subword_indicator_feature=False):

    feature_start_pos = 0 if keep_word else 1
    word_curser = 0
    subwords_cache = []
    subwords_with_features = []
    words_with_features = [words_with_feature.split(feature_delimiter) for words_with_feature in raw_feat.split()]

    for word in raw_no_feat_subword.split():
        # Split words
        subword_with_features = [word] + words_with_features[word_curser][feature_start_pos:]
        subwords = word.split(subword_delimiter)
        
        # The word is not a subword.
        if len(subwords) < 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('O')
            word_curser += 1

        # The word is a last subword of a word.
        elif len(subwords) < 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('E')
            del subwords_cache[:]

        # The word is a beginning subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('B')
            subwords_cache.append(word)
        
        # The word is a middle subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('I')
            subwords_cache.append(word)
        
        subwords_with_features.append(subword_with_features)

    # Detokenize back to raw.
    raw_feat_subword = ' '.join([feature_delimiter.join(subword_with_features) for subword_with_features in subwords_with_features])
    return raw_feat_subword

line_no_feat = "Unforturntly , almost older people can not use internet , in spite of benefit of internet ."
line_pos_dep = "Unforturntly￨PROPN￨advmod ,￨PUNCT￨punct almost￨ADV￨advmod older￨ADJ￨amod people￨NOUN￨nsubj can￨VERB￨aux not￨ADV￨neg use￨VERB￨ROOT internet￨NOUN￨dobj ,￨PUNCT￨punct in￨ADP￨prep spite￨NOUN￨pobj of￨ADP￨prep benefit￨NOUN￨pobj of￨ADP￨prep internet￨NOUN￨pobj .￨PUNCT￨punct"
line_no_feat_bpe = "Un￭ for￭ tur￭ n￭ tly , almost older people can not use internet , in spite of benefit of internet ."

Simply call:

align_features_for_subword_raw(line_no_feat_bpe,
                               line_pos_dep,
                               keep_word=True,
                               add_subword_indicator_feature=True)

returns:

'Un￭￨Unforturntly￨PROPN￨advmod￨B for￭￨Unforturntly￨PROPN￨advmod￨I tur￭￨Unforturntly￨PROPN￨advmod￨I n￭￨Unforturntly￨PROPN￨advmod￨I tly￨Unforturntly￨PROPN￨advmod￨E ,￨Unforturntly￨PROPN￨advmod￨O almost￨,￨PUNCT￨punct￨O older￨almost￨ADV￨advmod￨O people￨older￨ADJ￨amod￨O can￨people￨NOUN￨nsubj￨O not￨can￨VERB￨aux￨O use￨not￨ADV￨neg￨O internet￨use￨VERB￨ROOT￨O ,￨internet￨NOUN￨dobj￨O in￨,￨PUNCT￨punct￨O spite￨in￨ADP￨prep￨O of￨spite￨NOUN￨pobj￨O benefit￨of￨ADP￨prep￨O of￨benefit￨NOUN￨pobj￨O internet￨of￨ADP￨prep￨O .￨internet￨NOUN￨pobj￨O'

Can you please explain, what does the (B, O, I, E) tag represents?
Is B- the beginning of a word(start of BPE), O- standalone word (no BPE) , I- intermediate subword (of BPE), E- end tag of BPE?

Also, why the feature of :

"," is (,￨Unforturntly￨PROPN￨advmod￨O) not something like (,￨,￨XX￨XX￨O)
almost is (almost￨,￨PUNCT￨punct￨O) not something like (almost￨almost￨YY￨YY￨O)
as per the rule 2 mentioned by jsenellart .

Are you trying to include the feature of the previous word to the current word?

What happens if the BPE is applied to the following remaining words instead of only just the word "Unforturntly".

Thank you.

jsenellart self-assigned this Apr 8, 2018

jsenellart added the enhancement label Apr 8, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Help Wanted] How to apply BPE when input contains word features? #534

[Help Wanted] How to apply BPE when input contains word features? #534

howardyclo commented Apr 8, 2018 •

edited

jsenellart commented Apr 8, 2018

howardyclo commented Apr 8, 2018 •

edited

howardyclo commented Apr 8, 2018

howardyclo commented Apr 8, 2018 •

edited

howardyclo commented Apr 8, 2018 •

edited

jsenellart commented Apr 8, 2018 •

edited

howardyclo commented Apr 8, 2018

howardyclo commented Apr 9, 2018 •

edited

LinuxBeginner commented Aug 12, 2019

[Help Wanted] How to apply BPE when input contains word features? #534

[Help Wanted] How to apply BPE when input contains word features? #534

Comments

howardyclo commented Apr 8, 2018 • edited

jsenellart commented Apr 8, 2018

howardyclo commented Apr 8, 2018 • edited

howardyclo commented Apr 8, 2018

howardyclo commented Apr 8, 2018 • edited

howardyclo commented Apr 8, 2018 • edited

jsenellart commented Apr 8, 2018 • edited

howardyclo commented Apr 8, 2018

howardyclo commented Apr 9, 2018 • edited

LinuxBeginner commented Aug 12, 2019

howardyclo commented Apr 8, 2018 •

edited

howardyclo commented Apr 8, 2018 •

edited

howardyclo commented Apr 8, 2018 •

edited

howardyclo commented Apr 8, 2018 •

edited

jsenellart commented Apr 8, 2018 •

edited

howardyclo commented Apr 9, 2018 •

edited