decoding to AnnotatedPlainTextDocument fails when all tokens are removed by stopword list #291

ChristophLeonhardt · 2024-04-09T17:53:03Z

Scenario

I want to decode a document to an AnnotatedPlainTextDocument using a list of stopwords. If all tokens are removed when doing so, the process fails.

Example

As a minimal reproducible example, consider the following subcorpus:

library(polmineR)
use("polmineR")

x <- corpus("GERMAPARLMINI") |>
  subset(speaker == "Gerda Hasselfeldt") |>
  subset(protocol_date == "2009-11-11") |>
  subset(interjection == "speech") |>
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "protocol_date", gap = 0) |>
  _[[15]]

(The subcorpus is chosen because it is very short)

Now let's assume that we want to decode the subcorpus to an AnnotatedPlainTextDocument while removing stopwords:

tokens_to_remove = c(
  "Bitte",
  "sehr",
  polmineR::punctuation
)

This fails because all tokens are removed:

doc <- decode(
  x,
  to = "AnnotatedPlainTextDocument",
  p_attributes = "word",
  mw = NULL,
  stoplist = tokens_to_remove,
  verbose = FALSE
)

Issue

The initial issue is that the data.table ts becomes empty if the stoplist is applied:

polmineR/R/decode.R

Line 102 in 650c75f

if (!is.null(stoplist)) ts <- ts[!ts[["word"]] %in% stoplist]

This results in an error later when the annotation object is created since some slots in the object are not empty.

Discussion

I assume that the obvious part of the solution is to check whether ts is empty (i.e. whether nrow(ts) == 0L) after applying the list of stopwords. However, I am not sure what should be returned here.

Normally, the return value would be an annotation object. Is returning NULL compatible with the usual workflows here or would it be better to return an empty AnnotatedPlainTextDocument instead?

This is somewhat related to issue #285.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

decoding to AnnotatedPlainTextDocument fails when all tokens are removed by stopword list #291

decoding to AnnotatedPlainTextDocument fails when all tokens are removed by stopword list #291

ChristophLeonhardt commented Apr 9, 2024

decoding to AnnotatedPlainTextDocument fails when all tokens are removed by stopword list #291

decoding to AnnotatedPlainTextDocument fails when all tokens are removed by stopword list #291

Comments

ChristophLeonhardt commented Apr 9, 2024

Scenario

Example

Issue

Discussion