sentences_intro.html

<!DOCTYPE html>
<html lang="" xml:lang="">
  <head>
    <title>Cooking with GermaParl</title>
    <meta charset="utf-8" />
    <meta name="date" content="2023-12-14" />
    <script src="sentences_intro_files/header-attrs-2.25/header-attrs.js"></script>
    <link rel="stylesheet" href="css/default.css" type="text/css" />
    <link rel="stylesheet" href="css/metropolis.css" type="text/css" />
    <link rel="stylesheet" href="css/robot-fonts.css" type="text/css" />
    <link rel="stylesheet" href="css/polminify.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">
class: center, middle, inverse, title-slide

.title[
# Cooking with GermaParl
]
.date[
### 2023-12-14
]

---


# Purpose and Motivation

* GermaParl2 comprises **rich structural annotation** on the level of protocols (such as the date or the legislative period) and the level of speakers (such as a speakers name or parliamentary group)

* as seen in previous cookbooks, these can be used to **create meaningful subcorpora** for substantive analysis

* but even beyond that, the corpus contains **annotation below the level of speeches** in forms of paragraph and sentence annotation

* sentences can provide natural **units of analysis** with semantic meaning (for a comprehensive discussion see Däubler et al. 2012)

* sentence annotations can be used for a variety of **use cases**

---

# Encoding of Sentences

* sentences are annotated using **Stanford CoreNLP** (https://stanfordnlp.github.io/CoreNLP/)

* sentences are encoded as the **structural attribute** `s` in GermaParl2

* in contrast to other annotations in GermaParl, the sentence annotation **does not have values**; they describe regions in terms of start and end positions of sentences

* `polmineR` indicates the missing values when called with `s_attributes()`:


```r
s_attributes("GERMAPARL2", "s")
```

```
## ! s-attribute `s` does not have values, returning NA
```

```
## [1] NA
```

---

# Sentences and the tree structure 


```r
corpus("GERMAPARL2") %&gt;% polmineR::tree_structure()
```

```
## protocol [lp│no│date│year│url│filetype]
##    | 
##    └─ speaker [who│name│party│parlgroup│role]
##       | 
##       └─ p [type]
##          | 
##          └─ s
##             | 
##             └─ ne [type]
```

---

# Splitting Objects into Sentences

`polmineR` makes it easy to split a (sub)corpus into sentences


```r
sentences &lt;- corpus("GERMAPARL2") |&gt;
  subset(protocol_date == "1949-12-14") |&gt;
  split(s_attribute = "s", values = FALSE)
```

* the `values` argument of `split()` makes missing **values** explicit, but this is not strictly necessary

* the output is a bundle of subcorpora, each containing a single sentence

* splitting by sentences can also be done for corpora with sentence annotation (caution: GermaParl2 is quite large)

---

# Splitting Objects into Sentences
 
* subcorpus bundles can be used as usual

* one example would be to decode the sentences as strings in their word order for further analysis

* this could be useful for word embeddings or classification tasks which rely on word order


```r
sentences_ts &lt;- get_token_stream(sentences)
```

* the sentence annotation is not always perfect though:


```r
sentences_ts[[693]]
```

```
## [1] "—"        "Ich"      "schließe" "die"      "23"       "."
```


```r
sentences_ts[[694]]
```

```
## [1] "Sitzung"    "des"        "Deutschen"  "Bundestags" "."
```

---

# Sentence-Term-Matrices

* the sentence bundle can also be used as input to create a **Document-Term-Matrix** (in this case a sentence-term-matrix)
* potentially useful for **machine learning approaches** which rely on a **Bag-of-Words** representation of sentences
* examples: Sentence Similarity, Weighting of Terms


```r
dtm &lt;- polmineR::as.DocumentTermMatrix(sentences, p_attribute = "word")
```

---

# Sentence-Term-Matrices


```r
tm::inspect(dtm)
```

```
## &lt;&lt;DocumentTermMatrix (documents: 695, terms: 3255)&gt;&gt;
## Non-/sparse entries: 13796/2248429
## Sparsity           : 99%
## Maximal term length: 32
## Weighting          : term frequency (tf)
## Sample             :
##        Terms
## Docs     , . daß den der des die in und zu
##   36563 38 1   0   0   0   0   2  0   0  0
##   36589  8 1   1   1   4   0   2  3   2  1
##   36619  9 1   2   3  12   0   2  2   2  2
##   36643  7 1   0   0   1   1   2  1   0  1
##   36674  8 1   3   1   3   0   4  3   4  1
##   36695 10 1   4   0   3   1   4  1   3  0
##   36717  5 1   3   0   2   1   0  2   1  1
##   36733  5 1   0   1   2   2   2  2   3  0
##   36777  5 1   1   0   4   2   7  1   4  0
##   36787  6 1   1   1   4   0   0  3   2  1
```

---


# Using Sentences as Context Windows

* the boundaries of sentences can be used to define **context windows** of query terms

* this can be useful to limit the analysis to relevant context words or to identify meaningful multi-word query terms

* `polmineR` provides two ways to make use of the sentence annotation in these scenarios:

#### 1) Sentence Annotation as a `boundary`:

* the maximum number of tokens in the context window is determined by the values of `left` and `right` but the context does not extend over the boundary of a sentence


```r
corpus("GERMAPARL2") |&gt;
  kwic(query = "Demokratie",
       boundary = "s",
       left = 20,
       right = 20)
```

---

# Using Sentences as Context Windows

#### 2) Sentence Annotation as Context

* the context window is determined by the **structural attribute** - here `s` - defined by `region` and a number of sentences in `left` and `right`
  

```r
corpus("GERMAPARL2") |&gt;
  kwic(query = "Demokratie",
       region = "s",
       left = 0,
       right = 0)
```

* the annotation of `left` and `right` determines **additional context** in terms of sentences `s`

* i.e. if `s` = 0, then the context window comprises of the same sentence as the query term

---

# Using Sentences as Context Windows

* changing the values of `left` and `right` to 1 adds one additional sentence as context


```r
corpus("GERMAPARL2") |&gt;
  kwic(query = "Demokratie",
       region = "s",
       left = 1,
       right = 1)
```

* this is equivalent to the following syntax:


```r
corpus("GERMAPARL2") |&gt;
  kwic(query = "Demokratie",
       left = c("s" = 1),
       right = c("s" = 1))
```

---

# Using Sentences as Context Windows

* this also applies to values passed to other parameters such as `positivelist` and `stoplist`: 


```r
corpus("GERMAPARL2") |&gt;
  kwic(query = "Demokratie",
       region = "s",
       left = 0,
       right = 0,
       positivelist = "Krise"
  )
```

```
## ... filtering by positivelist
```

```
## ... number of hits dropped due to positivelist: 37002
```

```
## ... update count statistics for slot cpos
```

**Note:** Sentences which contain a query term more than once show up in the output of `kwic` more than once

---

# Using Sentences in CQP Queries

* as noted in the CQP manual, "most linguistic queries should include the restriction within s to avoid crossing sentence boundaries" (https://cwb.sourceforge.io/files/CQP_Manual/4_2.html)

* this can be achieved with the syntax used in the following query (results on the next slide)


```r
sc &lt;- corpus("GERMAPARL2") |&gt; subset(protocol_lp == 15)

count(sc,
      query = '"Bundesministerium.*" []{1,5} [xpos = "NN"] within s',
      cqp = TRUE,
      breakdown = TRUE)
```

* it has to be noted that this can be computationally expensive and depending on the use case, the differences are subtle

---

# Using Sentences in CQP Queries

&lt;table&gt;
&lt;caption&gt;First five Query Matches in GermaParl2, 15th Legislative Period&lt;/caption&gt;
 &lt;thead&gt;
  &lt;tr&gt;
   &lt;th style="text-align:left;"&gt; query &lt;/th&gt;
   &lt;th style="text-align:left;"&gt; match &lt;/th&gt;
   &lt;th style="text-align:right;"&gt; count &lt;/th&gt;
   &lt;th style="text-align:right;"&gt; share &lt;/th&gt;
  &lt;/tr&gt;
 &lt;/thead&gt;
&lt;tbody&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; &amp;quot;Bundesministerium.*&amp;quot; []{1,5} [xpos = &amp;quot;NN&amp;quot;] within s &lt;/td&gt;
   &lt;td style="text-align:left;"&gt; Bundesministeriums für Wirtschaft &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 84 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 6.78 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; &amp;quot;Bundesministerium.*&amp;quot; []{1,5} [xpos = &amp;quot;NN&amp;quot;] within s &lt;/td&gt;
   &lt;td style="text-align:left;"&gt; Bundesministeriums für Verkehr &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 71 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5.73 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; &amp;quot;Bundesministerium.*&amp;quot; []{1,5} [xpos = &amp;quot;NN&amp;quot;] within s &lt;/td&gt;
   &lt;td style="text-align:left;"&gt; Bundesministeriums des Innern &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 67 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5.41 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; &amp;quot;Bundesministerium.*&amp;quot; []{1,5} [xpos = &amp;quot;NN&amp;quot;] within s &lt;/td&gt;
   &lt;td style="text-align:left;"&gt; Bundesministeriums der Finanzen &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 66 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5.33 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; &amp;quot;Bundesministerium.*&amp;quot; []{1,5} [xpos = &amp;quot;NN&amp;quot;] within s &lt;/td&gt;
   &lt;td style="text-align:left;"&gt; Bundesministeriums für Gesundheit &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 66 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 5.33 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; &amp;quot;Bundesministerium.*&amp;quot; []{1,5} [xpos = &amp;quot;NN&amp;quot;] within s &lt;/td&gt;
   &lt;td style="text-align:left;"&gt; Bundesministerium des Innern &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 61 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 4.92 &lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;


---

# Sampling at the sentence level


```r
packageVersion("polmineR")
```

```
## [1] '0.8.9.9001'
```

```r
demsent_ids &lt;- corpus("GERMAPARL2") %&gt;%
  hits(query = "Demokratie", s_attribute = "s", decode = FALSE) %&gt;%
  as.data.frame() %&gt;%
  pull(s)

demsents &lt;- corpus("GERMAPARL2") %&gt;%
  subset(s %in% !!demsent_ids) %&gt;%
  split(s_attribute = "s") %&gt;%
  get_token_stream(p_attribute = "word", collapse = " ")
```

* write it on disk and use it as input for ... whatsoever!


---

# References

Däubler, T., Benoit, K., Mikhaylov, S., &amp; Laver, M. (2012). Natural Sentences as Valid Units for Coded Political Texts. British Journal of Political Science, 42(4), 937–951. http://www.jstor.org/stable/23274173.
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script>var slideshow = remark.create({
"countIncrementalSlides": false
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
// add `data-at-shortcutkeys` attribute to <body> to resolve conflicts with JAWS
// screen reader (see PR #262)
(function(d) {
  let res = {};
  d.querySelectorAll('.remark-help-content table tr').forEach(tr => {
    const t = tr.querySelector('td:nth-child(2)').innerText;
    tr.querySelectorAll('td:first-child .key').forEach(key => {
      const k = key.innerText;
      if (/^[a-z]$/.test(k)) res[k] = t;  // must be a single letter (key)
    });
  });
  d.body.setAttribute('data-at-shortcutkeys', JSON.stringify(res));
})(document);
(function() {
  "use strict"
  // Replace <script> tags in slides area to make them executable
  var scripts = document.querySelectorAll(
    '.remark-slides-area .remark-slide-container script'
  );
  if (!scripts.length) return;
  for (var i = 0; i < scripts.length; i++) {
    var s = document.createElement('script');
    var code = document.createTextNode(scripts[i].textContent);
    s.appendChild(code);
    var scriptAttrs = scripts[i].attributes;
    for (var j = 0; j < scriptAttrs.length; j++) {
      s.setAttribute(scriptAttrs[j].name, scriptAttrs[j].value);
    }
    scripts[i].parentElement.replaceChild(s, scripts[i]);
  }
})();
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>