You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
You can use the argument stoplist in kwic() or context() to drop matches based on a list of terms which must not occur in the context window of a query. This uses trim(). trim() checks whether a stop word occurs in a match and drops observations accordingly. While doing so, it also makes sure that the stop word occurs in the context and not in the node itself.
I think that this last mechanism does not work as expected. There seems to be a mismatch between the index created when identifying potential stop words in the match and the index used to determine which of the stop words are in the node instead of the context.
if (any(negatives)) return( NULL ) elsereturn( .SD ) # this is the only difference
}
In consequence, this leads to
matches being dropped although the stop word occurs only in the node
matches being kept although the stop word occurs in the context window and not in the node
Example for Scenario 1
In the following example with GERMAPARLMINI, the result "Integrationspolitik" gets filtered out by the stoplist although the stop word only should be applied to the context window, not the node itself.
These hits are not filtered correctly because the vector indicating the position of stop words in the cpos table (i.e. negatives in the chunk quoted above) does not reliably align with the position of the nodes (i.e. those nodes which position == 0). In the first scenario, the check falsely assumes that the stop words occurred in the context window while in the second scenario, the check suggests that the stop words occurred within the node.
Possible solution
I assume that it would suffice to omit the which() when creating the negatives vector. So instead of
This way, negatives would be a logical vector of length(.SD), i.e. c(TRUE, FALSE, TRUE, TRUE, ...) and subsetting it like in the chunk above should result in a vector that can be evaluated by this final any().
This should also work regardless of the row order as the p_attr column and the position column share the same order (which can be an issue if a positivelist is applied which changes the order in the cpos table).
Discussion
In the first scenario, I think that dropping hits when a stop word occurs in the node instead of in the context could also be considered acceptable behavior. By allowing this, you would avoid the need to write more complex and potentially slow CQP queries or regular expressions such as negative look-aheads, etc. However, the documentation says that only tokens in the context window are considered, so the results should be consistent with that.
For the second scenario, this seems to be unexpected and should be addressed in any case.
The text was updated successfully, but these errors were encountered:
Issue
You can use the argument
stoplist
inkwic()
orcontext()
to drop matches based on a list of terms which must not occur in the context window of a query. This usestrim()
.trim()
checks whether a stop word occurs in a match and drops observations accordingly. While doing so, it also makes sure that the stop word occurs in the context and not in the node itself.I think that this last mechanism does not work as expected. There seems to be a mismatch between the index created when identifying potential stop words in the match and the index used to determine which of the stop words are in the node instead of the context.
This refers to the following lines of code:
polmineR/R/trim.R
Lines 231 to 236 in 650c75f
In consequence, this leads to
Example for Scenario 1
In the following example with GERMAPARLMINI, the result "Integrationspolitik" gets filtered out by the stoplist although the stop word only should be applied to the context window, not the node itself.
Example for Scenario 2
The second scenario seems to occur quite rarely, so the example is a bit artificial. It does happen, though.
We see that in the following example, two hits are returned. However, the second hit should have been dropped due to the stop word.
Probable Cause
These hits are not filtered correctly because the vector indicating the position of stop words in the cpos table (i.e.
negatives
in the chunk quoted above) does not reliably align with the position of the nodes (i.e. those nodes which position == 0). In the first scenario, the check falsely assumes that the stop words occurred in the context window while in the second scenario, the check suggests that the stop words occurred within the node.Possible solution
I assume that it would suffice to omit the
which()
when creating thenegatives
vector. So instead ofThis might work:
This way,
negatives
would be a logical vector oflength(.SD)
, i.e. c(TRUE, FALSE, TRUE, TRUE, ...) and subsetting it like in the chunk above should result in a vector that can be evaluated by this finalany()
.This should also work regardless of the row order as the
p_attr
column and theposition
column share the same order (which can be an issue if a positivelist is applied which changes the order in the cpos table).Discussion
In the first scenario, I think that dropping hits when a stop word occurs in the node instead of in the context could also be considered acceptable behavior. By allowing this, you would avoid the need to write more complex and potentially slow CQP queries or regular expressions such as negative look-aheads, etc. However, the documentation says that only tokens in the context window are considered, so the results should be consistent with that.
For the second scenario, this seems to be unexpected and should be addressed in any case.
The text was updated successfully, but these errors were encountered: