Skip to content
This repository has been archived by the owner on Jun 10, 2021. It is now read-only.

[Help Wanted] How to apply BPE when input contains word features? #534

Open
howardyclo opened this issue Apr 8, 2018 · 9 comments
Open
Assignees

Comments

@howardyclo
Copy link

howardyclo commented Apr 8, 2018

I found that current tools/tokenize.lua will tokenize word features that concatenated with token when using bpe_model, resulting unwanted tokenization results. For example:

  • Input: I | FEAT1 am | FEAT2 Howard | FEAT3
  • [Updated] The expected output should be like (mentioned by @jsenellart bellow):
    • (new X feature) I|FEAT1 am|FEAT2 How|X ■ard|FEAT3
    • (duplicate FEAT3 feature) I|FEAT1 am|FEAT2 How|FEAT3 ■ard|FEAT3
    • (mix) I|FEAT1 am|FEAT2 How|XFEAT3 ■ard|FEAT3

However, the current tools/tokenize.lua will not ignore the word features and view "word+word features" as a single token. The result will be like:

  • Unwanted output: I ■ | ■ FEAT am ■ | ■ FEAT How ■ ard ■ | ■ FEAT

I am wondering how do I apply BPE when my input file contains word features?
I don't want the tokenization effect the word features.

@jsenellart
Copy link
Contributor

Hello Howard, we simply never did that, this is a bug. However, what would you expect for the split word feature? just duplicate it or change it to something else?

@howardyclo
Copy link
Author

howardyclo commented Apr 8, 2018

@jsenellart Hello, I expect the word features are not effected by the tokenizer. I think, yes, just duplicate it by concatenating back to the BPE-tokenized word.

@howardyclo
Copy link
Author

What I want to achieve is like the paper "Linguistic Input Features Improve Neural Machine Translation" did:
img

@howardyclo
Copy link
Author

howardyclo commented Apr 8, 2018

Like the above example, as usual, my output will be like (I list them instead of concatenating by space for readability):

  • Le ■ | Lenoidas | B | NNP | nsubj
  • oni ■ | Lenoidas | I | NNP | nsubj
  • das | Lenoidas | E | NNP | nsubj
  • beg ■ | beg | B | VBD | root
  • ged | beg | E | VBD | root
  • in | in | O | IN | prep
  • the | the | O | DT | det
  • arena | arena | O | NN | pobj
  • . | . | O | . | root

The first entity before "|" is BPE-tokenized word, then the followings are word features as usuall. I hope the tokenizer can support this input and just tokenize the first entity.

@howardyclo
Copy link
Author

howardyclo commented Apr 8, 2018

I may figure out the feasible but a little tricky way to achieve what I want:

  1. Learn BPE codes from word-level corpus with no word features by tools/learn_bpe.lua.
  2. Tokenize corpus with no word features using previously learned BPE codes (from 1.) by tools/tokenize.lua. We must enable joiner_annotate in order to detokenize. Now, we have a subword-level corpus with no word features.
  3. Build subword-level vocabulary for subword-level corpus with no word features (from 2.) by tools/build_vocab.lua.
  4. Build separate word feature vocabularies using additional word-level corpus with word features by tools/build_vocab.lua.
  5. Now, we have 2 separate vocabularies from 3. and 4., we can now build our training data by preprocess.lua with {src,tgt}_vocab set to the vocabulary from 3., and with features_vocabs_prefix set to the feature vocabulary from 4.

@jsenellart
Copy link
Contributor

jsenellart commented Apr 8, 2018

Sorry, I did not explain: it cannot work that way at least to train a model.

I|FEAT1 am|FEAT2 How ■ard|FEAT3
will not work (preprocess will fail) because How does not have feature, so we need to put one like:

  1. (new X feature) I|FEAT1 am|FEAT2 How|X ■ard|FEAT3
  2. (duplicate FEAT3 feature) I|FEAT1 am|FEAT2 How|FEAT3 ■ard|FEAT3
  3. (mix) I|FEAT1 am|FEAT2 How|XFEAT3 ■ard|FEAT3

(I did not see your latest comment) - the best in that case is to add a feature like what you propose:

  1. (new feature: B, O, I, E) I|O|FEAT1 am|O|FEAT2 How|B|XFEAT3 ■ard|E|FEAT3

I propose to introduce 3 modes (1, 2, 4) to cover these options.

@jsenellart jsenellart self-assigned this Apr 8, 2018
@howardyclo
Copy link
Author

@jsenellart Thanks, that will be great!

@howardyclo
Copy link
Author

howardyclo commented Apr 9, 2018

I wrote a feasible example python script that alleviate my requirement:

def align_features_for_subword_raw(raw_no_feat_subword, raw_feat,
                                   feature_delimiter = '│',
                                   subword_delimiter = '■',
                                   keep_word=False,
                                   add_subword_indicator_feature=False):

    feature_start_pos = 0 if keep_word else 1
    word_curser = 0
    subwords_cache = []
    subwords_with_features = []
    words_with_features = [words_with_feature.split(feature_delimiter) for words_with_feature in raw_feat.split()]

    for word in raw_no_feat_subword.split():
        # Split words
        subword_with_features = [word] + words_with_features[word_curser][feature_start_pos:]
        subwords = word.split(subword_delimiter)
        
        # The word is not a subword.
        if len(subwords) < 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('O')
            word_curser += 1

        # The word is a last subword of a word.
        elif len(subwords) < 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('E')
            del subwords_cache[:]

        # The word is a beginning subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('B')
            subwords_cache.append(word)
        
        # The word is a middle subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('I')
            subwords_cache.append(word)
        
        subwords_with_features.append(subword_with_features)

    # Detokenize back to raw.
    raw_feat_subword = ' '.join([feature_delimiter.join(subword_with_features) for subword_with_features in subwords_with_features])
    return raw_feat_subword

line_no_feat = "Unforturntly , almost older people can not use internet , in spite of benefit of internet ."
line_pos_dep = "Unforturntly│PROPN│advmod ,│PUNCT│punct almost│ADV│advmod older│ADJ│amod people│NOUN│nsubj can│VERB│aux not│ADV│neg use│VERB│ROOT internet│NOUN│dobj ,│PUNCT│punct in│ADP│prep spite│NOUN│pobj of│ADP│prep benefit│NOUN│pobj of│ADP│prep internet│NOUN│pobj .│PUNCT│punct"
line_no_feat_bpe = "Un■ for■ tur■ n■ tly , almost older people can not use internet , in spite of benefit of internet ."

Simply call:

align_features_for_subword_raw(line_no_feat_bpe,
                               line_pos_dep,
                               keep_word=True,
                               add_subword_indicator_feature=True)

returns:

'Un■│Unforturntly│PROPN│advmod│B for■│Unforturntly│PROPN│advmod│I tur■│Unforturntly│PROPN│advmod│I n■│Unforturntly│PROPN│advmod│I tly│Unforturntly│PROPN│advmod│E ,│Unforturntly│PROPN│advmod│O almost│,│PUNCT│punct│O older│almost│ADV│advmod│O people│older│ADJ│amod│O can│people│NOUN│nsubj│O not│can│VERB│aux│O use│not│ADV│neg│O internet│use│VERB│ROOT│O ,│internet│NOUN│dobj│O in│,│PUNCT│punct│O spite│in│ADP│prep│O of│spite│NOUN│pobj│O benefit│of│ADP│prep│O of│benefit│NOUN│pobj│O internet│of│ADP│prep│O .│internet│NOUN│pobj│O'

@LinuxBeginner
Copy link

I wrote a feasible example python script that alleviate my requirement:

def align_features_for_subword_raw(raw_no_feat_subword, raw_feat,
                                   feature_delimiter = '│',
                                   subword_delimiter = '■',
                                   keep_word=False,
                                   add_subword_indicator_feature=False):

    feature_start_pos = 0 if keep_word else 1
    word_curser = 0
    subwords_cache = []
    subwords_with_features = []
    words_with_features = [words_with_feature.split(feature_delimiter) for words_with_feature in raw_feat.split()]

    for word in raw_no_feat_subword.split():
        # Split words
        subword_with_features = [word] + words_with_features[word_curser][feature_start_pos:]
        subwords = word.split(subword_delimiter)
        
        # The word is not a subword.
        if len(subwords) < 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('O')
            word_curser += 1

        # The word is a last subword of a word.
        elif len(subwords) < 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('E')
            del subwords_cache[:]

        # The word is a beginning subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) == 0:
            if add_subword_indicator_feature: subword_with_features.append('B')
            subwords_cache.append(word)
        
        # The word is a middle subword of a word.
        elif len(subwords) >= 2 and len(subwords_cache) > 0:
            if add_subword_indicator_feature: subword_with_features.append('I')
            subwords_cache.append(word)
        
        subwords_with_features.append(subword_with_features)

    # Detokenize back to raw.
    raw_feat_subword = ' '.join([feature_delimiter.join(subword_with_features) for subword_with_features in subwords_with_features])
    return raw_feat_subword

line_no_feat = "Unforturntly , almost older people can not use internet , in spite of benefit of internet ."
line_pos_dep = "Unforturntly│PROPN│advmod ,│PUNCT│punct almost│ADV│advmod older│ADJ│amod people│NOUN│nsubj can│VERB│aux not│ADV│neg use│VERB│ROOT internet│NOUN│dobj ,│PUNCT│punct in│ADP│prep spite│NOUN│pobj of│ADP│prep benefit│NOUN│pobj of│ADP│prep internet│NOUN│pobj .│PUNCT│punct"
line_no_feat_bpe = "Un■ for■ tur■ n■ tly , almost older people can not use internet , in spite of benefit of internet ."

Simply call:

align_features_for_subword_raw(line_no_feat_bpe,
                               line_pos_dep,
                               keep_word=True,
                               add_subword_indicator_feature=True)

returns:

'Un■│Unforturntly│PROPN│advmod│B for■│Unforturntly│PROPN│advmod│I tur■│Unforturntly│PROPN│advmod│I n■│Unforturntly│PROPN│advmod│I tly│Unforturntly│PROPN│advmod│E ,│Unforturntly│PROPN│advmod│O almost│,│PUNCT│punct│O older│almost│ADV│advmod│O people│older│ADJ│amod│O can│people│NOUN│nsubj│O not│can│VERB│aux│O use│not│ADV│neg│O internet│use│VERB│ROOT│O ,│internet│NOUN│dobj│O in│,│PUNCT│punct│O spite│in│ADP│prep│O of│spite│NOUN│pobj│O benefit│of│ADP│prep│O of│benefit│NOUN│pobj│O internet│of│ADP│prep│O .│internet│NOUN│pobj│O'

Can you please explain, what does the (B, O, I, E) tag represents?
Is B- the beginning of a word(start of BPE), O- standalone word (no BPE) , I- intermediate subword (of BPE), E- end tag of BPE?

Also, why the feature of :

"," is (,│Unforturntly│PROPN│advmod│O) not something like (,│,│XX│XX│O)
almost is (almost│,│PUNCT│punct│O) not something like (almost│almost│YY│YY│O)
as per the rule 2 mentioned by jsenellart .

Are you trying to include the feature of the previous word to the current word?

What happens if the BPE is applied to the following remaining words instead of only just the word "Unforturntly".

Thank you.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Development

No branches or pull requests

3 participants