Creating a speech bundle without interjections #281

ChristophLeonhardt · 2023-11-20T13:44:06Z

Subsetting a speech bundle results in a subcorpus bundle with unexpected subcorpora as the initial separation into speeches is not kept.

Hence the question: What is the most efficient way to create a speech bundle without interjections?

Scenario I: Splitting into speeches, then subsetting by paragraph type

Using GERMAPARL2 to create a speech bundle seems to work fine. The output is a subcorpus bundle with about 450 thousand subcorpora.

library(polmineR)

all_speeches <- corpus("GERMAPARL2") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date")

Assumption: I want to omit all interjections from these speeches. I think the logical step would be a subset.

all_speeches_min <- all_speeches |>
  subset(p_type == "speech")

Expected output: A subcorpus bundle with the same subcorpora (assuming that there are no speeches which only contain interjections) but without paragraphs which are not of type "speech".

Observed output: A subcorpus bundle with about 4400 subcorpora.

It seems like here there is one subcorpus for each unique speaker, not for each speech.

Scenario II: Subsetting by paragraph type, then splitting into speeches

In contrast, this seems to work.

all_speeches_2 <- corpus("GERMAPARL2") |>
  subset(p_type == "speech") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date")

Discussion

Aside from the second approach being very slow, it does not seem obvious to me why the first approach should not work. Is the first scenario supposed to work in the first place? If it should work like this, there might be a bug. If it not supposed to work like that, then some additional documentation might be useful.

Additional Remarks

The as.speeches() method also has a subset argument but as also written in the documentation, this is currently only useful for speaker names (speaker) and dates (date) and does not work for other structural attributes.

This was tested using polmineR 0.8.9.9001.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Creating a speech bundle without interjections #281

Creating a speech bundle without interjections #281

ChristophLeonhardt commented Nov 20, 2023

Creating a speech bundle without interjections #281

Creating a speech bundle without interjections #281

Comments

ChristophLeonhardt commented Nov 20, 2023

Scenario I: Splitting into speeches, then subsetting by paragraph type

Scenario II: Subsetting by paragraph type, then splitting into speeches

Discussion

Additional Remarks