Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a function to compound tokens with skips #2104

Open
koheiw opened this issue Apr 13, 2021 · 4 comments
Open

Create a function to compound tokens with skips #2104

koheiw opened this issue Apr 13, 2021 · 4 comments

Comments

@koheiw
Copy link
Collaborator

koheiw commented Apr 13, 2021

Following the discussion on #2102, I created the dev-skipgram2 branch to add skip to tokens_compound(). I managed to make it possible to generate skipgrams, but removing original tokens of compounds appeared very difficult. I also discovered that we have window already.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 1 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens("London is not a bad city")
tokens_compound(toks, phrase(c("not *")), skip = 0:5, join = FALSE)
#> 3 4 
#> 3 5 
#> 3 6 
#> 1 2 3 7 8 9 4 5 6 
#> id_comp 10, id_last 6
#> Ngram 3 4 
#> ID 7
#> Ngram 3 6 
#> ID 9
#> Ngram 3 5 
#> ID 8
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "London"   "is"       "not"      "not_a"    "not_bad"  "not_city" "a"       
#> [8] "bad"      "city"
tokens_compound(toks, "not", window = 1, join = FALSE)
#> id_comp 8, id_last 6
#> Ngram 2 3 4 
#> ID 7
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "London"   "is_not_a" "bad"      "city"

Created on 2021-04-13 by the reprex package (v1.0.0)

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 13, 2021

It would be more useful and easy to implement skip for window, which will be like

tokens_compound(toks, "not", window = 2, skip = 1)
#> [1] "London_not"  "not_bad"  "a"  "city"

Here, all the unigrams are removed and skipgrams are created within the window.

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 13, 2021

That seems to me like a very good way to implement it. Things to think about:

  • Compounding in its plain form converts the elements into a compound, removing the elements as stand-alone tokens. We would want this behaviour to continue with the new functionality. So in the first example above, "not" would no longer exist as a separate token. (In the second example, "not" is (correctly) no longer kept as a separate token.) However if skip is vectorised and includes 0, we could then keep the original element.
  • We would want to keep tokens not compounded. So in the second example, "is" should be retained - the skip is only for the purpose of compounding.
  • For multiple skips, e.g. 1:2, it's more complex, since some elements are part of some compounds but not others. We would probably need a rule that any element that is used to form a compound is removed as a stand-alone element token, or alternatively, that any element not used in every compound is also kept as a stand-alone element token.
  • If it's not much harder to implement, we could consider making window a two-element vector with before and after sizes. It's a use case that arises from time to time (e.g. https://stackoverflow.com/questions/67055841/fcm-function-of-quanteda-any-possibility-of-selecting-one-side-of-the-window) but we have only implemented it as a two-element vector so far in tokens_select().

@koheiw
Copy link
Collaborator Author

koheiw commented Apr 13, 2021

I have no good idea to remove original tokens in the first example, so I will leave this branch for the moment.

@kbenoit
Copy link
Collaborator

kbenoit commented Apr 13, 2021

Any rule: If a token is used in any compound, remove it (as it works now without a skip argument).

All rule: If a token is is used in all compounds, remove it. Only applies if length(skip) > 1.

The first already works, when skip is not used.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants