Keeping matches in which a stop word only occurs in the node is not reliable #289

ChristophLeonhardt · 2024-03-26T11:22:58Z

Issue

You can use the argument stoplist in kwic() or context() to drop matches based on a list of terms which must not occur in the context window of a query. This uses trim(). trim() checks whether a stop word occurs in a match and drops observations accordingly. While doing so, it also makes sure that the stop word occurs in the context and not in the node itself.

I think that this last mechanism does not work as expected. There seems to be a mismatch between the index created when identifying potential stop words in the match and the index used to determine which of the stop words are in the node instead of the context.

This refers to the following lines of code:

polmineR/R/trim.R

Lines 231 to 236 in 650c75f

    
           .fn <- function(.SD){ 
        
             p_attr <- paste(p_attribute[1], "id", sep = "_") 
        
             negatives <- which(.SD[[p_attr]] %in% stoplist_ids) 
        
             negatives <- negatives[ -which(.SD[["position"]] == 0) ] # exclude node 
        
             if (any(negatives)) return( NULL ) else return( .SD ) # this is the only difference 
        
           }

In consequence, this leads to

matches being dropped although the stop word occurs only in the node
matches being kept although the stop word occurs in the context window and not in the node

Example for Scenario 1

In the following example with GERMAPARLMINI, the result "Integrationspolitik" gets filtered out by the stoplist although the stop word only should be applied to the context window, not the node itself.

library(polmineR) # v0.8.9.9004
use("polmineR")

kwic("GERMAPARLMINI",
     query = '"Integration.*"',
     stoplist = ".*[Pp]olitik",
     regex = TRUE)

Example for Scenario 2

The second scenario seems to occur quite rarely, so the example is a bit artificial. It does happen, though.

We see that in the following example, two hits are returned. However, the second hit should have been dropped due to the stop word.

kwic("GERMAPARLMINI",
     query = '"Morgen"',
     positivelist = "in",
     stoplist = "der",
     cqp = TRUE,
     regex = TRUE)

Probable Cause

These hits are not filtered correctly because the vector indicating the position of stop words in the cpos table (i.e. negatives in the chunk quoted above) does not reliably align with the position of the nodes (i.e. those nodes which position == 0). In the first scenario, the check falsely assumes that the stop words occurred in the context window while in the second scenario, the check suggests that the stop words occurred within the node.

Possible solution

I assume that it would suffice to omit the which() when creating the negatives vector. So instead of

negatives <- which(.SD[[p_attr]] %in% stoplist_ids)

This might work:

negatives <- .SD[[p_attr]] %in% stoplist_ids

This way, negatives would be a logical vector of length(.SD), i.e. c(TRUE, FALSE, TRUE, TRUE, ...) and subsetting it like in the chunk above should result in a vector that can be evaluated by this final any().

This should also work regardless of the row order as the p_attr column and the position column share the same order (which can be an issue if a positivelist is applied which changes the order in the cpos table).

Discussion

In the first scenario, I think that dropping hits when a stop word occurs in the node instead of in the context could also be considered acceptable behavior. By allowing this, you would avoid the need to write more complex and potentially slow CQP queries or regular expressions such as negative look-aheads, etc. However, the documentation says that only tokens in the context window are considered, so the results should be consistent with that.

For the second scenario, this seems to be unexpected and should be addressed in any case.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Keeping matches in which a stop word only occurs in the node is not reliable #289

Keeping matches in which a stop word only occurs in the node is not reliable #289

ChristophLeonhardt commented Mar 26, 2024

Keeping matches in which a stop word only occurs in the node is not reliable #289

Keeping matches in which a stop word only occurs in the node is not reliable #289

Comments

ChristophLeonhardt commented Mar 26, 2024

Issue

Example for Scenario 1

Example for Scenario 2

Probable Cause

Possible solution

Discussion