Creating a subcorpus bundle while ignoring the values of the structural attribute #263

ChristophLeonhardt · 2023-09-06T18:19:11Z

Splitting a subcorpus using a structural attribute without values - at least in principle - seems to be possible now (but see issue #262).

While having a look at this, I encountered a different scenario in which splitting a corpus by the value of a structural attribute might not lead to the desired output.

Use case: There might be scenarios in which I want to ignore the distinct values of a structural attribute and split the corpus every time a structure changes regardless of the value of the structural attribute. An example would be to split GermaParl2 into paragraphs:

paragraphs <- corpus("GERMAPARL2") |>
  subset(protocol_year == 2000) |>
  subset(p_type == "speech") |> 
  split(s_attribute = "p")

The desired output would be a subcorpus bundle containing all paragraphs of type "speech" separately.

However, the structural attribute "p" has values (containing the type of the paragraph). So, split() behaves as expected, splitting the subcorpus by these values. In this case, because of the previous subset(), there is only one value left and thus the result is a subcorpus bundle containing a single paragraph with all "speech" paragraphs.

I am not entirely sure whether this use case is too specific to warrant an intervention for polmineR. Maybe it should be addressed by improving the data instead. But maybe, there could be a solution to use the mechanism introduced for structural attributes without values (such as "s", i.e. sentences in GermaParl2) to create a "paragraph" bundle regardless of the actual values of the structural attribute.

The text was updated successfully, but these errors were encountered:

ablaette · 2023-09-07T08:39:24Z

Argument values of subset() for corpus objects of previous versions of polmineR was used to reduce the list of subcorpora in a bundle to a set defined by a character vector. Splitting by an s-attribute without values was not implemented.

I did this now. This is a very basic example that I use in the unit tests.

library(polmineR)
use("GermaParl2") 

corpus("GERMAPARL2MINI") %>%
      split(s_attribute = "p", values = FALSE, verbose)

This is a modification of your scenario:

paragraphs <- corpus("GERMAPARL2") |>
  subset(protocol_year == 2000) |>
  subset(p_type == "speech") |> 
  split(s_attribute = "p")

Here, I will have 107691 paragraphs. A lot, but plausible.

ChristophLeonhardt · 2023-09-07T09:46:25Z

Thank you very much for the quick response.

I think the final chunk is supposed to be:

paragraphs <- corpus("GERMAPARL2") |>
  subset(protocol_year == 2000) |>
  subset(p_type == "speech") |> 
  split(s_attribute = "p", values = FALSE)

This seems like a handy solution to the problem I described. As far as I am concerned, this issue can be closed.

ablaette pushed a commit that referenced this issue Sep 7, 2023

test for split() by s-attribute without values #263

4b12314

ablaette pushed a commit that referenced this issue Sep 7, 2023

split() for corpus if s_attr does not have values #263

204330f

ablaette pushed a commit that referenced this issue Sep 7, 2023

NEWS reports fix for #263

7119edf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a subcorpus bundle while ignoring the values of the structural attribute #263

Creating a subcorpus bundle while ignoring the values of the structural attribute #263

ChristophLeonhardt commented Sep 6, 2023

ablaette commented Sep 7, 2023

ChristophLeonhardt commented Sep 7, 2023 •

edited

Creating a subcorpus bundle while ignoring the values of the structural attribute #263

Creating a subcorpus bundle while ignoring the values of the structural attribute #263

Comments

ChristophLeonhardt commented Sep 6, 2023

ablaette commented Sep 7, 2023

ChristophLeonhardt commented Sep 7, 2023 • edited

ChristophLeonhardt commented Sep 7, 2023 •

edited