Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keeping matches in which a stop word only occurs in the node is not reliable #289

Open
ChristophLeonhardt opened this issue Mar 26, 2024 · 0 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

Issue

You can use the argument stoplist in kwic() or context() to drop matches based on a list of terms which must not occur in the context window of a query. This uses trim(). trim() checks whether a stop word occurs in a match and drops observations accordingly. While doing so, it also makes sure that the stop word occurs in the context and not in the node itself.

I think that this last mechanism does not work as expected. There seems to be a mismatch between the index created when identifying potential stop words in the match and the index used to determine which of the stop words are in the node instead of the context.

This refers to the following lines of code:

polmineR/R/trim.R

Lines 231 to 236 in 650c75f

.fn <- function(.SD){
p_attr <- paste(p_attribute[1], "id", sep = "_")
negatives <- which(.SD[[p_attr]] %in% stoplist_ids)
negatives <- negatives[ -which(.SD[["position"]] == 0) ] # exclude node
if (any(negatives)) return( NULL ) else return( .SD ) # this is the only difference
}

In consequence, this leads to

  1. matches being dropped although the stop word occurs only in the node
  2. matches being kept although the stop word occurs in the context window and not in the node

Example for Scenario 1

In the following example with GERMAPARLMINI, the result "Integrationspolitik" gets filtered out by the stoplist although the stop word only should be applied to the context window, not the node itself.

library(polmineR) # v0.8.9.9004
use("polmineR")

kwic("GERMAPARLMINI",
     query = '"Integration.*"',
     stoplist = ".*[Pp]olitik",
     regex = TRUE)

Example for Scenario 2

The second scenario seems to occur quite rarely, so the example is a bit artificial. It does happen, though.

We see that in the following example, two hits are returned. However, the second hit should have been dropped due to the stop word.

kwic("GERMAPARLMINI",
     query = '"Morgen"',
     positivelist = "in",
     stoplist = "der",
     cqp = TRUE,
     regex = TRUE)

Probable Cause

These hits are not filtered correctly because the vector indicating the position of stop words in the cpos table (i.e. negatives in the chunk quoted above) does not reliably align with the position of the nodes (i.e. those nodes which position == 0). In the first scenario, the check falsely assumes that the stop words occurred in the context window while in the second scenario, the check suggests that the stop words occurred within the node.

Possible solution

I assume that it would suffice to omit the which() when creating the negatives vector. So instead of

negatives <- which(.SD[[p_attr]] %in% stoplist_ids)

This might work:

negatives <- .SD[[p_attr]] %in% stoplist_ids

This way, negatives would be a logical vector of length(.SD), i.e. c(TRUE, FALSE, TRUE, TRUE, ...) and subsetting it like in the chunk above should result in a vector that can be evaluated by this final any().

This should also work regardless of the row order as the p_attr column and the position column share the same order (which can be an issue if a positivelist is applied which changes the order in the cpos table).

Discussion

In the first scenario, I think that dropping hits when a stop word occurs in the node instead of in the context could also be considered acceptable behavior. By allowing this, you would avoid the need to write more complex and potentially slow CQP queries or regular expressions such as negative look-aheads, etc. However, the documentation says that only tokens in the context window are considered, so the results should be consistent with that.

For the second scenario, this seems to be unexpected and should be addressed in any case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant