Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Filtering words composed of more than 1 token #4

Open
mataney opened this issue Dec 9, 2019 · 5 comments
Open

Filtering words composed of more than 1 token #4

mataney opened this issue Dec 9, 2019 · 5 comments
Assignees

Comments

@mataney
Copy link

mataney commented Dec 9, 2019

Hi, thanks for the great works.

I see that you are filtering out words that are composed of more than one token:

single_bow = list(filter(lambda x: len(x) <= 1, single_bow))
, which makes it filter quite a bit of words (including all terms that have more than one word).

Do you have any idea how to deal with this when we want to use these multi token words?

Cheers.

@dathath
Copy link
Contributor

dathath commented Dec 10, 2019

I think one option would be to compute the probability of multiple tokens being generated and use that the same way the single token probability is being used.

Let's say there is a word that splits into two tokens s1, s2: Instead of p(w|x) in equation 5, you could potentially replace this by p(s1|x)*p(s2|s1,x), and I suspect that should work with everything else as is.

I haven't tested this, if you have any luck with this, let us know. Alternatively, I plan on testing it at some point soon and can get back (will update the code appropriately).

@dathath dathath self-assigned this Dec 10, 2019
@monkdou0
Copy link

monkdou0 commented May 6, 2021

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])

i try to run this code, and all words are composed of more than one token,
i think this is because add_prefix_space=True.
Did I do something wrong?

@vaibhavvarshney0
Copy link

Hi,
any update on this?

@janleemark
Copy link

Hi,
It's there any implementation of code on phrases(or more than one token)?

@yananchen1989
Copy link

    bow_indices.append(
        [tokenizer.encode(word.strip(),
                          add_prefix_space=True,
                          add_special_tokens=False)
         for word in words])

i try to run this code, and all words are composed of more than one token, i think this is because add_prefix_space=True. Did I do something wrong?

@monkdou0

image
setting add_prefix_space to True, will not make it into more token ids.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants