Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

get_token_stream() potentially returns token stream of the wrong length when using arguments subset, collapse and beautify #290

Open
ChristophLeonhardt opened this issue Apr 9, 2024 · 0 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

I noticed that get_token_stream() occasionally returns token streams which are longer than expected when used with the subset argument.

As an example, I have the following subcorpus:

library(polmineR) # v0.8.9.9004
use("polmineR")

sc <- corpus("GERMAPARLMINI") |>
  subset(protocol_date == "2009-10-27") |>
  subset(speaker == "Volker Kauder")

Without subset

I retrieve the token stream for the subcorpus without subsetting it:

chars_with_stopwords <- get_token_stream(sc,
                                         p_attribute = "word",
                                         collapse = " ")

nchar(chars_with_stopwords) # 185

The returned character vector has a length of 185 characters.

With subset

If I repeat the same process but include a subset argument to remove stop words and punctuation, the return value gets longer instead of shorter.

tokens_to_remove = c(
  tm::stopwords("de"),
  polmineR::punctuation
)


chars_without_stopwords <- get_token_stream(
    sc,
    p_attribute = "word",
    collapse = " ",
    subset = {!word %in% tokens_to_remove}
)

nchar(chars_without_stopwords) # 270

Issue

Looking at get_token_stream() I think that the issue is the combination of subset, collapse and beautify (which is TRUE by default). With these arguments, the following line essentially causes the issue:

whitespace <- rep(collapse, times = length(.Object))

The issue is that when removing tokens via the subset, the length of the input object does not correspond to the number of whitespace characters actually needed here. Then, in the final line

tokens <- paste(paste(whitespace, tokens, sep = ""), collapse = "")

whitespace is longer than tokens. The remaining tokens then are simply recycled until the length of whitespace is reached.

Potential Fix

If there is no reason to use the length of the unmodified input object here, I think that changing .Object to tokens in the first chunk I quoted should be sufficient to address this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant