Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a subcorpus bundle while ignoring the values of the structural attribute #263

Open
ChristophLeonhardt opened this issue Sep 6, 2023 · 2 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

Splitting a subcorpus using a structural attribute without values - at least in principle - seems to be possible now (but see issue #262).

While having a look at this, I encountered a different scenario in which splitting a corpus by the value of a structural attribute might not lead to the desired output.

Use case: There might be scenarios in which I want to ignore the distinct values of a structural attribute and split the corpus every time a structure changes regardless of the value of the structural attribute. An example would be to split GermaParl2 into paragraphs:

paragraphs <- corpus("GERMAPARL2") |>
  subset(protocol_year == 2000) |>
  subset(p_type == "speech") |> 
  split(s_attribute = "p")

The desired output would be a subcorpus bundle containing all paragraphs of type "speech" separately.

However, the structural attribute "p" has values (containing the type of the paragraph). So, split() behaves as expected, splitting the subcorpus by these values. In this case, because of the previous subset(), there is only one value left and thus the result is a subcorpus bundle containing a single paragraph with all "speech" paragraphs.

I am not entirely sure whether this use case is too specific to warrant an intervention for polmineR. Maybe it should be addressed by improving the data instead. But maybe, there could be a solution to use the mechanism introduced for structural attributes without values (such as "s", i.e. sentences in GermaParl2) to create a "paragraph" bundle regardless of the actual values of the structural attribute.

@ablaette
Copy link
Collaborator

ablaette commented Sep 7, 2023

Argument values of subset() for corpus objects of previous versions of polmineR was used to reduce the list of subcorpora in a bundle to a set defined by a character vector. Splitting by an s-attribute without values was not implemented.

I did this now. This is a very basic example that I use in the unit tests.

library(polmineR)
use("GermaParl2") 

corpus("GERMAPARL2MINI") %>%
      split(s_attribute = "p", values = FALSE, verbose)

This is a modification of your scenario:

paragraphs <- corpus("GERMAPARL2") |>
  subset(protocol_year == 2000) |>
  subset(p_type == "speech") |> 
  split(s_attribute = "p")

Here, I will have 107691 paragraphs. A lot, but plausible.

@ChristophLeonhardt
Copy link
Contributor Author

ChristophLeonhardt commented Sep 7, 2023

Thank you very much for the quick response.

I think the final chunk is supposed to be:

paragraphs <- corpus("GERMAPARL2") |>
  subset(protocol_year == 2000) |>
  subset(p_type == "speech") |> 
  split(s_attribute = "p", values = FALSE)

This seems like a handy solution to the problem I described. As far as I am concerned, this issue can be closed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants