How to return tokens matching a dictionary lookup? #2063

kbenoit · 2021-02-21T11:57:06Z

This comes from quanteda/quanteda.sentiment#11, which is a more general question about how a function can return the set of original tokens matching a dictionary lookup, not just using tokens_select(), but rather returning the matches along with each key.

In the issue referred to above, a data.frame output was requested, although this could of course be a list, or a list by document.

Here's how I cobbled together a means - but it would be more efficient to consider building this in as a function.

library("quanteda")
## Package version: 2.9.9000
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dict <- dictionary(list(
  positive = c("good", "not bad"),
  negative = "not good"
))

toks <- tokens(c(
  d1 = "The good test was not good",
  d2 = "It's not good to be not bad"
))

toks2 <- toks %>%
  tokens_replace(rep(names(dict), lengths(dict)), unlist(dict, use.names = FALSE)) %>%
  tokens_select(dict) %>%
  tokens_compound(dict, concatenator = " ")

data.frame(
  key = as.character(tokens_lookup(toks2, dict, nested_scope = "dictionary")),
  token = as.character(toks2)
)
##        key    token
## 1 positive     good
## 2 negative not good
## 3 negative not good
## 4 positive  not bad

Created on 2021-02-21 by the reprex package (v1.0.0)

The text was updated successfully, but these errors were encountered:

koheiw · 2021-02-21T12:20:27Z

> kwic(toks, dict)[,c("pattern", "keyword")]
   pattern  keyword
1 positive     good
2 negative not good
3 positive     good
4 negative not good
5 positive     good
6 positive  not bad

kbenoit · 2021-02-21T12:33:40Z

That's a good and quick ("kwic"? 😄) solution! But how would we deal with the nested dictionary issue, so that in d1, we don't match "not good" as "pattern = positive, keyword = good"?

koheiw · 2021-02-21T13:02:37Z

We can consider upgrading the function to behave in a similar way to tokens_lookup().

koheiw · 2023-10-19T07:43:25Z

@kbenoit should we do anything with this for v4.0

kbenoit · 2023-11-06T18:20:44Z

It would be good to solve this but unnecessary for 4.0. I'll kick it down the road to 4.1. 🦵🥫

kbenoit added question dictionary labels Feb 21, 2021

kbenoit assigned kbenoit and koheiw Feb 21, 2021

kbenoit mentioned this issue Feb 21, 2021

Feature Request: parameter to return tokens with polarity quanteda/quanteda.sentiment#11

Closed

koheiw mentioned this issue Feb 23, 2021

Issue 1840 #2045

Merged

kbenoit added this to the v4 release milestone Apr 12, 2023

kbenoit modified the milestones: v4 release, v4.1 Nov 6, 2023

koheiw mentioned this issue Dec 7, 2023

Upgrading tokens_replace() to keep tokens and keys togather #2324

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to return tokens matching a dictionary lookup? #2063

How to return tokens matching a dictionary lookup? #2063

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

koheiw commented Oct 19, 2023

kbenoit commented Nov 6, 2023

How to return tokens matching a dictionary lookup? #2063

How to return tokens matching a dictionary lookup? #2063

Comments

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

kbenoit commented Feb 21, 2021

koheiw commented Feb 21, 2021

koheiw commented Oct 19, 2023

kbenoit commented Nov 6, 2023