Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to return tokens matching a dictionary lookup? #2063

Open
kbenoit opened this issue Feb 21, 2021 · 5 comments
Open

How to return tokens matching a dictionary lookup? #2063

kbenoit opened this issue Feb 21, 2021 · 5 comments
Assignees
Milestone

Comments

@kbenoit
Copy link
Collaborator

kbenoit commented Feb 21, 2021

This comes from quanteda/quanteda.sentiment#11, which is a more general question about how a function can return the set of original tokens matching a dictionary lookup, not just using tokens_select(), but rather returning the matches along with each key.

In the issue referred to above, a data.frame output was requested, although this could of course be a list, or a list by document.

Here's how I cobbled together a means - but it would be more efficient to consider building this in as a function.

library("quanteda")
## Package version: 2.9.9000
## Unicode version: 10.0
## ICU version: 61.1
## Parallel computing: 12 of 12 threads used.
## See https://quanteda.io for tutorials and examples.

dict <- dictionary(list(
  positive = c("good", "not bad"),
  negative = "not good"
))

toks <- tokens(c(
  d1 = "The good test was not good",
  d2 = "It's not good to be not bad"
))

toks2 <- toks %>%
  tokens_replace(rep(names(dict), lengths(dict)), unlist(dict, use.names = FALSE)) %>%
  tokens_select(dict) %>%
  tokens_compound(dict, concatenator = " ")

data.frame(
  key = as.character(tokens_lookup(toks2, dict, nested_scope = "dictionary")),
  token = as.character(toks2)
)
##        key    token
## 1 positive     good
## 2 negative not good
## 3 negative not good
## 4 positive  not bad

Created on 2021-02-21 by the reprex package (v1.0.0)

@koheiw
Copy link
Collaborator

koheiw commented Feb 21, 2021

> kwic(toks, dict)[,c("pattern", "keyword")]
   pattern  keyword
1 positive     good
2 negative not good
3 positive     good
4 negative not good
5 positive     good
6 positive  not bad

@kbenoit
Copy link
Collaborator Author

kbenoit commented Feb 21, 2021

That's a good and quick ("kwic"? 😄) solution! But how would we deal with the nested dictionary issue, so that in d1, we don't match "not good" as "pattern = positive, keyword = good"?

@koheiw
Copy link
Collaborator

koheiw commented Feb 21, 2021

We can consider upgrading the function to behave in a similar way to tokens_lookup().

@koheiw koheiw mentioned this issue Feb 23, 2021
@kbenoit kbenoit added this to the v4 release milestone Apr 12, 2023
@koheiw
Copy link
Collaborator

koheiw commented Oct 19, 2023

@kbenoit should we do anything with this for v4.0

@kbenoit
Copy link
Collaborator Author

kbenoit commented Nov 6, 2023

It would be good to solve this but unnecessary for 4.0. I'll kick it down the road to 4.1. 🦵🥫

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants