cqp.Rmd

---
title: "Cooking with GermaParl: Spicy Queries"
date: "2023-11-23"
output: 
  xaringan::moon_reader:
    css: ["./css/default.css", "./css/metropolis.css", "./css/robot-fonts.css", './css/polminify.css']
    nature:
      countIncrementalSlides: false
editor_options: 
  chunk_output_type: console
---

```{r libraries, echo = FALSE, message = FALSE, warning = FALSE}
library(DiagrammeR)
library(xml2)
library(kableExtra)
library(dplyr)
library(purrr)
```

```{r load_polmineR, echo = FALSE, message = FALSE, warning = FALSE}
library(polmineR)
```

# Data formats and structures

* TEI-inspired XML: <br/>https://github.com/PolMine/GermaParlTEI
* Corpus Workbench (CWB): <br/>https://zenodo.org/record/7949074

The CWB data is indexed and linguistically annotated!

See the [landing page at Zenodo](https://zenodo.org/record/7949074) for further installation instructions!

---

# GermaParl: Data Preparation Workflow

```{r make_diagrammeR", echo = FALSE, fig.width = 11}
grViz("
  digraph {
  graph [layout = dot, rankdir = LR, nodesep = 0.5]
  

node [shape = cylinder style = filled fillcolor = Gray93]
RAW1[label = 'Raw Data 1']
RAW2[label = 'Raw Data 2']
RAW3[label = 'Raw Data 3']
XML[label = 'TEI/XML']
XML2[label = 'TEI/XML (enriched)']
VRT[label = 'VRT']
CWB[label = 'CWB']

node[shape = rect style=filled fillcolor = Gray]
PARSE[label = 'Preprocessing and Parsing\n(frappp)']
ENRICH[label = 'Enrich\n(frappp)']
ANNOTATE[label = 'Linguistic Annotation\n(bignlp/Stanford CoreNLP, TreeTagger)']
ENCODE[label = 'Encoding\n(Corpus Workbench)']


RAW1 -> PARSE
RAW2 -> PARSE
RAW3 -> PARSE

PARSE -> XML -> ENRICH-> XML2 -> ANNOTATE -> VRT -> ENCODE -> CWB
   }
")
```

---

# XML/TEI

* structural annotation (metadata on document and speaker level)
* quasi standardization, inspired by the standards of the Text Encoding Initiative (TEI)

```{r read_xml, echo = FALSE}
xml2::read_xml(x = "~/Lab/github/GermaParlTEI/01/BT_01_003.xml") %>%
  as.character() %>%
  cat()
```

---

# The Verticalized Text format (VRT)

* linguistically annotated "verticalized text" (tab separated and annotated with XML tags)
* see Corpus Encoding Manual (https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial/)

```{r read_vrt, echo = FALSE}
xml2::read_xml("~/Lab/tmp/GermaParl2/vrt/BT_01_003.vrt") %>%
  as.character() %>%
  strsplit(split = "\n") %>% 
  pluck(1) %>% 
  head(26) %>% 
  cat(sep = "\n")
```

---

# Positional attributes

Attributes at the token level are *positional attributes*:

* word: the "surface form" of the token
* xpos: Part-of-Speech-Annotation #1 (Stuttgart-Tübingen Tagset / STTS)
* upos: Part-of-Speech-Annotation #2 (Universal Dependencies Tagset)
* lemma: Lemmatized form of the token

```{r sample_ts, echo = FALSE, message = FALSE}
sample_ts <- partition("GERMAPARL2", protocol_date = "1949-09-15") %>%
  polmineR::decode(
    s_attributes = character(), 
    p_attributes = c("word", "xpos", "upos", "lemma"),
    to = "data.table",
    verbose = FALSE
  ) %>%
  .[1:5]

knitr::kable(
  sample_ts, 
  format = ifelse(knitr::is_html_output(), "html", "latex"),
  booktabs = TRUE, 
  escape = TRUE, 
  caption = "A token stream representation of the GermaParl corpus") %>%
  footnote(general = "Beginning of session 1/3 (1949-09-15)")
```

---

# Structural attributes

Aside from annotation on token level, sentences, paragraps and named entities are linguistically annotated. These can comprise more than one token and thus are annotated as *structural attributes*. The relevant structural attributes are:

* `p` - regions defining paragraphs (no distinct values). 
* `p_type` distinguishes between speeches and other kinds of text such as stage comments or interjections
* `s` - regions defining sentences (no distinct values). 
* `ne_type` describes the type of a region which has been annotated as a named entity. These can be "ORGANIZATION", "LOCATION", "MISC" or "PERSON".

The `p` and `s` annotation layers can be used to subset the corpus based on sentences and paragraphs or to set boundaries for analyses like `kwic()`.

---
# Structural attributes of GermaParl

```{r}
corpus("GERMAPARL2") %>% tree_structure() # dev version of polmineR required
```
---

# CQP: The corpus query processor

`polmineR` exposes the powerful CQP query language to query CWB corpora. Using the CQP syntax, the linguistic annotation layers are accessible. The CWB development team provides a [comprehensive handbook](https://cwb.sourceforge.io/files/CQP_Manual.pdf).

To be distinguished:

- CQP command line tool
- RcppCWB 
- polmineR

---

# CQP and regular expressions

```{r kwic_regular}
corpus("GERMAPARL2") |>
  kwic(
    query = '"[Dd]emokratisch.*"',
    cqp = TRUE,
    s_attribute = "protocol_date"
  ) 
```


---

# Positional attributes

... the CQP syntax allows to specify the positional attribute:

```{r kwic_cqp_1}
corpus("GERMAPARL2") |>
  kwic(query = '[word = "[dD]emokratisch.*"]', s_attribute = "protocol_date")
```

---

# Positional attributes

Note the single quotation marks around the entire query as well as the squared brackets.

This introduces some flexibility to the query. For example, the part of speech tag can be used as additional information. 

While the German word for "Fliege" is ambivalent (just like the English word) where it can describe an animal or a specific form of the verb "to fly", the CQP syntax allows to differentiate between the two. 

```{r kwic_comp_1}
corpus("GERMAPARL2") |>
  kwic(
    query = '[word = "Fliege" & xpos = "V.*"]',
    s_attribute = c("protocol_lp", "protocol_no")
  )
```

```{r kwic_comp_2}
corpus("GERMAPARL2") |>
  kwic(
    query = '[word = "Fliege" & xpos = "N.*"]',
    s_attribute = c("protocol_lp", "protocol_no")
  )
```

---

# CQP and advanced queries

CQP queries can be more complex and work for most `polmineR` methods.

```{r count_1}
corpus("GERMAPARL2") |>
  polmineR::count(
    query = '"freiheitlich" "-" [lemma = "demokratisch"] "Grundordnung"',
    breakdown = TRUE
  ) |>
  as.data.frame()
```

The manual does provide some even more advanced queries. 

```{r count_2, eval = FALSE}
corpus("GERMAPARL2") |>
  subset(protocol_lp %in% 5:6) |>
  count(
    query = '(?longest)[xpos = "V.*"] []{0,5} [word = "Transformation"]',
    breakdown = TRUE
  )
```

---

# CQP and structural attributes

Linguistic annotations can be used in structural attributes as well.

```{r count_ner, eval = FALSE}
corpus("GERMAPARL2") |>
  polmineR::count(
    query = '/region[ne_type,a]::a.ne_type="ORGANIZATION"',
    cqp = TRUE,
    breakdown = TRUE
  )
```

---

# Counting Named Entities

```{r, eval = FALSE}
x <- corpus("GERMAPARL2") %>% 
  subset(ne_type = "PERSON") %>% 
  split(s_attribute = "ne_type") %>% 
  get_token_stream(p_attribute = "word") %>% 
  table()
```

---

# A time-series plot

```{r, message = FALSE}
library(dplyr)
library(xts)
library(lubridate) # we need lubridate::floor_date()

look_up <- '"freiheitlich" "-" [lemma = "demokratisch"] "Grundordnung"'

corpus("GERMAPARL2") %>% 
  dispersion(query = look_up, cqp = TRUE, s_attribute = "protocol_date") %>%
  as_tibble() %>%
  mutate(date = as.Date(protocol_date)) %>%
  mutate(year = floor_date(date, unit = "year")) %>%
  filter(!is.na(year)) %>%
  select(count, year) %>%
  group_by(year) %>%
  summarise(sum = sum(count)) %>%
  as.xts(x = .$sum, order.by = .$year) %>%
  plot(
    main = "'FDGO' in Bundestag Plenary Debates (N/quarter)",
    xlab = "total per year",
    ylim = c(0, 100), col = "darkblue", main.timespan = FALSE
  )
```