Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorpora MANIFESTO corpus example? #56

Open
ablaette opened this issue Feb 1, 2024 · 1 comment
Open

Incorpora MANIFESTO corpus example? #56

ablaette opened this issue Feb 1, 2024 · 1 comment

Comments

@ablaette
Copy link
Collaborator

ablaette commented Feb 1, 2024

This is an example I developed for building a CWB-indexed MARPOR corpus. Maybe incorporate it into vignette, or include it as a R Markdown template?

library(cwbtools)
library(polmineR)
library(RcppCWB)
library(manifestoR)
library(dplyr)
library(purrr)
library(tidytext)
library(data.table)

party_names <- c(
  "41112" = "GRUENE",
  "41113" = "GRUENE",
  "41221" = "PDS",
  "41222" = "LINKE",
  "41223" = "LINKE",
  "41320" = "SPD",
  "41420" = "FDP",
  "41521" = "CDU/CSU",
  "41952" = "Piraten",
  "41953" = "AfD",
  "41912" = "SSW"
)

mp_setapikey("~/.credentials/manifesto_apikey.txt")

tokstream <- mp_corpus(countryname == "Germany" & edate > as.Date("2017-10-01")) %>% 
  lapply(as.data.frame) %>% 
  lapply(as_tibble) %>% 
  imap(function(tbl, id) mutate(tbl, id = id)) %>% 
  bind_rows() %>% 
  mutate(yearmon = gsub("^.*_(\\d{4})(\\d{2})$", "\\1-\\2", id)) %>% 
  mutate(party = party_names[gsub("^(.*)_.*$", "\\1", id)]) %>% 
  unnest_tokens(word, text, to_lower = FALSE, strip_punct = FALSE) %>% 
  mutate(cpos = 0L:(nrow(.) - 1L))

mdata <- tokstream %>% 
  group_by(pos, yearmon, party) %>% 
  summarise(cpos_left = min(cpos), cpos_right = max(cpos), .groups = "rowwise") %>% 
  mutate(pos = as.character(pos)) %>% 
  relocate(cpos_left, cpos_right) %>% 
  arrange(cpos_left)
  
MARPOR <- CorpusData$new()
MARPOR$tokenstream <- tokstream %>% select(cpos, word) %>% as.data.table()
MARPOR$metadata <- mdata

data_dir_tmp <- fs::path(tempdir(), "data_dir")
registry_tmp <- fs::path(tempdir(), "registry")

dir.create(data_dir_tmp)
dir.create(registry_tmp)

MARPOR$encode(
  corpus = "MARPOR",
  encoding = "utf8",
  p_attributes = "word",
  s_attributes = c("pos", "yearmon", "party"),
  registry_dir = registry_tmp,
  data_dir = data_dir_tmp,
  method = "R",
  compress = FALSE
)
RcppCWB::cl_delete_corpus("MARPOR", registry = registry_tmp)
RcppCWB::cqp_reset_registry(registry_tmp)

polmineR::corpus("MARPOR") %>% 
  kwic(query = "Zuwanderung", s_attributes = "party")
@ablaette
Copy link
Collaborator Author

This is a relevant use case, but due to the restricted access to the Marpor Corpus, the workflow is not suited very well vor the vignette. So I prepared a blogpost that I can be finalized and published when cwbtools v0.4.0 is released at CRAN.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant