/
cqp.Rmd
290 lines (216 loc) · 7.34 KB
/
cqp.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "Cooking with GermaParl: Spicy Queries"
date: "2023-11-23"
output:
xaringan::moon_reader:
css: ["./css/default.css", "./css/metropolis.css", "./css/robot-fonts.css", './css/polminify.css']
nature:
countIncrementalSlides: false
editor_options:
chunk_output_type: console
---
```{r libraries, echo = FALSE, message = FALSE, warning = FALSE}
library(DiagrammeR)
library(xml2)
library(kableExtra)
library(dplyr)
library(purrr)
```
```{r load_polmineR, echo = FALSE, message = FALSE, warning = FALSE}
library(polmineR)
```
# Data formats and structures
* TEI-inspired XML: <br/>https://github.com/PolMine/GermaParlTEI
* Corpus Workbench (CWB): <br/>https://zenodo.org/record/7949074
The CWB data is indexed and linguistically annotated!
See the [landing page at Zenodo](https://zenodo.org/record/7949074) for further installation instructions!
---
# GermaParl: Data Preparation Workflow
```{r make_diagrammeR", echo = FALSE, fig.width = 11}
grViz("
digraph {
graph [layout = dot, rankdir = LR, nodesep = 0.5]
node [shape = cylinder style = filled fillcolor = Gray93]
RAW1[label = 'Raw Data 1']
RAW2[label = 'Raw Data 2']
RAW3[label = 'Raw Data 3']
XML[label = 'TEI/XML']
XML2[label = 'TEI/XML (enriched)']
VRT[label = 'VRT']
CWB[label = 'CWB']
node[shape = rect style=filled fillcolor = Gray]
PARSE[label = 'Preprocessing and Parsing\n(frappp)']
ENRICH[label = 'Enrich\n(frappp)']
ANNOTATE[label = 'Linguistic Annotation\n(bignlp/Stanford CoreNLP, TreeTagger)']
ENCODE[label = 'Encoding\n(Corpus Workbench)']
RAW1 -> PARSE
RAW2 -> PARSE
RAW3 -> PARSE
PARSE -> XML -> ENRICH-> XML2 -> ANNOTATE -> VRT -> ENCODE -> CWB
}
")
```
---
# XML/TEI
* structural annotation (metadata on document and speaker level)
* quasi standardization, inspired by the standards of the Text Encoding Initiative (TEI)
```{r read_xml, echo = FALSE}
xml2::read_xml(x = "~/Lab/github/GermaParlTEI/01/BT_01_003.xml") %>%
as.character() %>%
cat()
```
---
# The Verticalized Text format (VRT)
* linguistically annotated "verticalized text" (tab separated and annotated with XML tags)
* see Corpus Encoding Manual (https://cwb.sourceforge.io/files/CWB_Encoding_Tutorial/)
```{r read_vrt, echo = FALSE}
xml2::read_xml("~/Lab/tmp/GermaParl2/vrt/BT_01_003.vrt") %>%
as.character() %>%
strsplit(split = "\n") %>%
pluck(1) %>%
head(26) %>%
cat(sep = "\n")
```
---
# Positional attributes
Attributes at the token level are *positional attributes*:
* word: the "surface form" of the token
* xpos: Part-of-Speech-Annotation #1 (Stuttgart-Tübingen Tagset / STTS)
* upos: Part-of-Speech-Annotation #2 (Universal Dependencies Tagset)
* lemma: Lemmatized form of the token
```{r sample_ts, echo = FALSE, message = FALSE}
sample_ts <- partition("GERMAPARL2", protocol_date = "1949-09-15") %>%
polmineR::decode(
s_attributes = character(),
p_attributes = c("word", "xpos", "upos", "lemma"),
to = "data.table",
verbose = FALSE
) %>%
.[1:5]
knitr::kable(
sample_ts,
format = ifelse(knitr::is_html_output(), "html", "latex"),
booktabs = TRUE,
escape = TRUE,
caption = "A token stream representation of the GermaParl corpus") %>%
footnote(general = "Beginning of session 1/3 (1949-09-15)")
```
---
# Structural attributes
Aside from annotation on token level, sentences, paragraps and named entities are linguistically annotated. These can comprise more than one token and thus are annotated as *structural attributes*. The relevant structural attributes are:
* `p` - regions defining paragraphs (no distinct values).
* `p_type` distinguishes between speeches and other kinds of text such as stage comments or interjections
* `s` - regions defining sentences (no distinct values).
* `ne_type` describes the type of a region which has been annotated as a named entity. These can be "ORGANIZATION", "LOCATION", "MISC" or "PERSON".
The `p` and `s` annotation layers can be used to subset the corpus based on sentences and paragraphs or to set boundaries for analyses like `kwic()`.
---
# Structural attributes of GermaParl
```{r}
corpus("GERMAPARL2") %>% tree_structure() # dev version of polmineR required
```
---
# CQP: The corpus query processor
`polmineR` exposes the powerful CQP query language to query CWB corpora. Using the CQP syntax, the linguistic annotation layers are accessible. The CWB development team provides a [comprehensive handbook](https://cwb.sourceforge.io/files/CQP_Manual.pdf).
To be distinguished:
- CQP command line tool
- RcppCWB
- polmineR
---
# CQP and regular expressions
```{r kwic_regular}
corpus("GERMAPARL2") |>
kwic(
query = '"[Dd]emokratisch.*"',
cqp = TRUE,
s_attribute = "protocol_date"
)
```
---
# Positional attributes
... the CQP syntax allows to specify the positional attribute:
```{r kwic_cqp_1}
corpus("GERMAPARL2") |>
kwic(query = '[word = "[dD]emokratisch.*"]', s_attribute = "protocol_date")
```
---
# Positional attributes
Note the single quotation marks around the entire query as well as the squared brackets.
This introduces some flexibility to the query. For example, the part of speech tag can be used as additional information.
While the German word for "Fliege" is ambivalent (just like the English word) where it can describe an animal or a specific form of the verb "to fly", the CQP syntax allows to differentiate between the two.
```{r kwic_comp_1}
corpus("GERMAPARL2") |>
kwic(
query = '[word = "Fliege" & xpos = "V.*"]',
s_attribute = c("protocol_lp", "protocol_no")
)
```
```{r kwic_comp_2}
corpus("GERMAPARL2") |>
kwic(
query = '[word = "Fliege" & xpos = "N.*"]',
s_attribute = c("protocol_lp", "protocol_no")
)
```
---
# CQP and advanced queries
CQP queries can be more complex and work for most `polmineR` methods.
```{r count_1}
corpus("GERMAPARL2") |>
polmineR::count(
query = '"freiheitlich" "-" [lemma = "demokratisch"] "Grundordnung"',
breakdown = TRUE
) |>
as.data.frame()
```
The manual does provide some even more advanced queries.
```{r count_2, eval = FALSE}
corpus("GERMAPARL2") |>
subset(protocol_lp %in% 5:6) |>
count(
query = '(?longest)[xpos = "V.*"] []{0,5} [word = "Transformation"]',
breakdown = TRUE
)
```
---
# CQP and structural attributes
Linguistic annotations can be used in structural attributes as well.
```{r count_ner, eval = FALSE}
corpus("GERMAPARL2") |>
polmineR::count(
query = '/region[ne_type,a]::a.ne_type="ORGANIZATION"',
cqp = TRUE,
breakdown = TRUE
)
```
---
# Counting Named Entities
```{r, eval = FALSE}
x <- corpus("GERMAPARL2") %>%
subset(ne_type = "PERSON") %>%
split(s_attribute = "ne_type") %>%
get_token_stream(p_attribute = "word") %>%
table()
```
---
# A time-series plot
```{r, message = FALSE}
library(dplyr)
library(xts)
library(lubridate) # we need lubridate::floor_date()
look_up <- '"freiheitlich" "-" [lemma = "demokratisch"] "Grundordnung"'
corpus("GERMAPARL2") %>%
dispersion(query = look_up, cqp = TRUE, s_attribute = "protocol_date") %>%
as_tibble() %>%
mutate(date = as.Date(protocol_date)) %>%
mutate(year = floor_date(date, unit = "year")) %>%
filter(!is.na(year)) %>%
select(count, year) %>%
group_by(year) %>%
summarise(sum = sum(count)) %>%
as.xts(x = .$sum, order.by = .$year) %>%
plot(
main = "'FDGO' in Bundestag Plenary Debates (N/quarter)",
xlab = "total per year",
ylim = c(0, 100), col = "darkblue", main.timespan = FALSE
)
```