Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

decoding to AnnotatedPlainTextDocument fails when all tokens are removed by stopword list #291

Open
ChristophLeonhardt opened this issue Apr 9, 2024 · 0 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

Scenario

I want to decode a document to an AnnotatedPlainTextDocument using a list of stopwords. If all tokens are removed when doing so, the process fails.

Example

As a minimal reproducible example, consider the following subcorpus:

library(polmineR)
use("polmineR")

x <- corpus("GERMAPARLMINI") |>
  subset(speaker == "Gerda Hasselfeldt") |>
  subset(protocol_date == "2009-11-11") |>
  subset(interjection == "speech") |>
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "protocol_date", gap = 0) |>
  _[[15]]

(The subcorpus is chosen because it is very short)

Now let's assume that we want to decode the subcorpus to an AnnotatedPlainTextDocument while removing stopwords:

tokens_to_remove = c(
  "Bitte",
  "sehr",
  polmineR::punctuation
)

This fails because all tokens are removed:

doc <- decode(
  x,
  to = "AnnotatedPlainTextDocument",
  p_attributes = "word",
  mw = NULL,
  stoplist = tokens_to_remove,
  verbose = FALSE
)

Issue

The initial issue is that the data.table ts becomes empty if the stoplist is applied:

if (!is.null(stoplist)) ts <- ts[!ts[["word"]] %in% stoplist]

This results in an error later when the annotation object is created since some slots in the object are not empty.

Discussion

I assume that the obvious part of the solution is to check whether ts is empty (i.e. whether nrow(ts) == 0L) after applying the list of stopwords. However, I am not sure what should be returned here.

Normally, the return value would be an annotation object. Is returning NULL compatible with the usual workflows here or would it be better to return an empty AnnotatedPlainTextDocument instead?

This is somewhat related to issue #285.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant