Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Creating a speech bundle without interjections #281

Open
ChristophLeonhardt opened this issue Nov 20, 2023 · 0 comments
Open

Creating a speech bundle without interjections #281

ChristophLeonhardt opened this issue Nov 20, 2023 · 0 comments

Comments

@ChristophLeonhardt
Copy link
Contributor

Subsetting a speech bundle results in a subcorpus bundle with unexpected subcorpora as the initial separation into speeches is not kept.

Hence the question: What is the most efficient way to create a speech bundle without interjections?

Scenario I: Splitting into speeches, then subsetting by paragraph type

Using GERMAPARL2 to create a speech bundle seems to work fine. The output is a subcorpus bundle with about 450 thousand subcorpora.

library(polmineR)

all_speeches <- corpus("GERMAPARL2") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date")

Assumption: I want to omit all interjections from these speeches. I think the logical step would be a subset.

all_speeches_min <- all_speeches |>
  subset(p_type == "speech")

Expected output: A subcorpus bundle with the same subcorpora (assuming that there are no speeches which only contain interjections) but without paragraphs which are not of type "speech".

Observed output: A subcorpus bundle with about 4400 subcorpora.

It seems like here there is one subcorpus for each unique speaker, not for each speech.

Scenario II: Subsetting by paragraph type, then splitting into speeches

In contrast, this seems to work.

all_speeches_2 <- corpus("GERMAPARL2") |>
  subset(p_type == "speech") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date")

Discussion

Aside from the second approach being very slow, it does not seem obvious to me why the first approach should not work. Is the first scenario supposed to work in the first place? If it should work like this, there might be a bug. If it not supposed to work like that, then some additional documentation might be useful.

Additional Remarks

The as.speeches() method also has a subset argument but as also written in the documentation, this is currently only useful for speaker names (speaker) and dates (date) and does not work for other structural attributes.

This was tested using polmineR 0.8.9.9001.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant