Create a function to compound tokens with skips #2104

koheiw · 2021-04-13T06:11:25Z

Following the discussion on #2102, I created the dev-skipgram2 branch to add skip to tokens_compound(). I managed to make it possible to generate skipgrams, but removing original tokens of compounds appeared very difficult. I also discovered that we have window already.

require(quanteda)
#> Loading required package: quanteda
#> Package version: 3.0.0
#> Unicode version: 13.0
#> ICU version: 66.1
#> Parallel computing: 1 of 6 threads used.
#> See https://quanteda.io for tutorials and examples.
toks <- tokens("London is not a bad city")
tokens_compound(toks, phrase(c("not *")), skip = 0:5, join = FALSE)
#> 3 4 
#> 3 5 
#> 3 6 
#> 1 2 3 7 8 9 4 5 6 
#> id_comp 10, id_last 6
#> Ngram 3 4 
#> ID 7
#> Ngram 3 6 
#> ID 9
#> Ngram 3 5 
#> ID 8
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "London"   "is"       "not"      "not_a"    "not_bad"  "not_city" "a"       
#> [8] "bad"      "city"
tokens_compound(toks, "not", window = 1, join = FALSE)
#> id_comp 8, id_last 6
#> Ngram 2 3 4 
#> ID 7
#> Tokens consisting of 1 document.
#> text1 :
#> [1] "London"   "is_not_a" "bad"      "city"

^{Created on 2021-04-13 by the reprex package (v1.0.0)}

The text was updated successfully, but these errors were encountered:

koheiw · 2021-04-13T06:25:06Z

It would be more useful and easy to implement skip for window, which will be like

tokens_compound(toks, "not", window = 2, skip = 1)
#> [1] "London_not"  "not_bad"  "a"  "city"

Here, all the unigrams are removed and skipgrams are created within the window.

kbenoit · 2021-04-13T10:54:33Z

That seems to me like a very good way to implement it. Things to think about:

Compounding in its plain form converts the elements into a compound, removing the elements as stand-alone tokens. We would want this behaviour to continue with the new functionality. So in the first example above, "not" would no longer exist as a separate token. (In the second example, "not" is (correctly) no longer kept as a separate token.) However if skip is vectorised and includes 0, we could then keep the original element.
We would want to keep tokens not compounded. So in the second example, "is" should be retained - the skip is only for the purpose of compounding.
For multiple skips, e.g. 1:2, it's more complex, since some elements are part of some compounds but not others. We would probably need a rule that any element that is used to form a compound is removed as a stand-alone element token, or alternatively, that any element not used in every compound is also kept as a stand-alone element token.
If it's not much harder to implement, we could consider making window a two-element vector with before and after sizes. It's a use case that arises from time to time (e.g. https://stackoverflow.com/questions/67055841/fcm-function-of-quanteda-any-possibility-of-selecting-one-side-of-the-window) but we have only implemented it as a two-element vector so far in tokens_select().

koheiw · 2021-04-13T11:48:59Z

I have no good idea to remove original tokens in the first example, so I will leave this branch for the moment.

kbenoit · 2021-04-13T14:24:46Z

Any rule: If a token is used in any compound, remove it (as it works now without a skip argument).

All rule: If a token is is used in all compounds, remove it. Only applies if length(skip) > 1.

The first already works, when skip is not used.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Create a function to compound tokens with skips #2104

Create a function to compound tokens with skips #2104

koheiw commented Apr 13, 2021 •

edited

koheiw commented Apr 13, 2021

kbenoit commented Apr 13, 2021

koheiw commented Apr 13, 2021

kbenoit commented Apr 13, 2021

Create a function to compound tokens with skips #2104

Create a function to compound tokens with skips #2104

Comments

koheiw commented Apr 13, 2021 • edited

koheiw commented Apr 13, 2021

kbenoit commented Apr 13, 2021

koheiw commented Apr 13, 2021

kbenoit commented Apr 13, 2021

koheiw commented Apr 13, 2021 •

edited