/
sentences.Rmd
137 lines (101 loc) · 4.71 KB
/
sentences.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
---
title: "How to add a sentence annotation to an indexed corpus"
author: "Andreas Blätte (andreas.blaette@uni-due.de)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{How to add a sentence annotation to an indexed corpus}
%\VignetteEncoding{UTF-8}
%\VignetteEngine{knitr::rmarkdown}
editor_options:
chunk_output_type: console
---
## About
Adding an annotation of sentences as a structural attribute to an existing is a frequent scenario. This vignette offers a basic recipe for a corpus that already includes a part-of-speech annotation. The GermaParlSample corpus serves as an example.
## Getting started
In addition to the **cwbtools** package, we use the functionality of the [RcppCWB](https://CRAN.R-project.org/package=RcppCWB) package to decode the p-attribute 'pos'. The same could be achieved using the higher-level `get_token_stream()` method of the [polmineR](https://CRAN.R-project.org/package=polmineR) package, but we want to avoid creating an additional dependency of this package.
```{r load_libs, eval = TRUE}
library(cwbtools)
library(RcppCWB)
```
For the purposes of this vignette, we work with a temporary copy of the corpus we wish to augment, so we create a temporary corpus directory.
```{r tmp_dirs_and_data, eval = TRUE}
registry_dir_tmp <- fs::path(tempdir(), "registry_dir_tmp")
corpus_dir_tmp <- fs::path(tempdir(), "corpus_dir_tmp")
dir.create(path = registry_dir_tmp)
dir.create(path = corpus_dir_tmp)
regdir_envvar <- Sys.getenv("CORPUS_REGISTRY")
Sys.setenv(CORPUS_REGISTRY = registry_dir_tmp)
```
The GermaParlSample corpus can be downloaded from [Zenodo](https://doi.org/10.5281/zenodo.3823245) as follows.
```{r get_germaparlsample, eval = TRUE}
is_corpus_available <- cwbtools::corpus_install(
doi = "10.5281/zenodo.3823245",
registry_dir = registry_dir_tmp, corpus_dir = corpus_dir_tmp,
verbose = FALSE
)
```
```{r}
list.files(file.path(corpus_dir_tmp, "germaparlsample"))
```
## Recipe
We generate the data for the sentence annotation from the part-of-speech annotation that is already present.
At first, we decode the p-attribute "pos".
```{r decode_pos, eval = is_corpus_available}
germaparl_size <- cl_attribute_size(
corpus = "GERMAPARLSAMPLE",
attribute = "word", attribute_type = "p"
)
cpos_vec <- seq.int(from = 0L, to = germaparl_size - 1L)
ids <- cl_cpos2id(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", cpos = cpos_vec)
pos <- cl_id2str(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", id = ids)
```
The pos-tag for Stuttgart Tuebingen Tag Set (STTS) is a "$.". From this information, we can generate a region matrix with start and end corpus positions of sentences easily.
```{r sentence_regions, eval = is_corpus_available}
sentence_end <- grep("\\$\\.", pos)
sentence_factor <- cut(
x = cpos_vec, breaks = c(0L, sentence_end),
include.lowest = TRUE, right = FALSE
)
sentences_cpos <- unname(split(x = cpos_vec, f = sentence_factor))
region_matrix <- do.call(
rbind,
lapply(sentences_cpos, function(cpos) c(cpos[1L], cpos[length(cpos)]))
)
```
So let us see what this looks like ...
```{r show_matrix, eval = is_corpus_available}
head(region_matrix)
```
And this is how the new annotation layer is written back to the corpus.
```{r write_annotation, eval = is_corpus_available}
s_attribute_encode(
values = as.character(seq.int(from = 0L, to = nrow(region_matrix) - 1L)),
data_dir = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["home"]],
s_attribute = "s",
corpus = "GERMAPARLSAMPLE",
region_matrix = region_matrix,
method = "R",
registry_dir = Sys.getenv("CORPUS_REGISTRY"),
encoding = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["properties"]][["charset"]],
delete = TRUE,
verbose = TRUE
)
```
## Checking the result
To see whether everything has worked, we get the left and right boundaries of the sentence with corpus position 60.
```{r check_annotation, eval = is_corpus_available}
left <- cl_cpos2lbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
right <- cl_cpos2rbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
ids <- cl_cpos2id("GERMAPARLSAMPLE", cpos = left:right, p_attribute = "word")
cl_id2str("GERMAPARLSAMPLE", p_attribute = "word", id = ids)
```
## Cleaning up
As a matter of housekeeping, we remove temporary directories and restore value of environment variable CORPUS_REGISTRY.
```{r housekeeping, eval = is_corpus_available}
unlink(corpus_dir_tmp)
unlink(registry_dir_tmp)
Sys.setenv(CORPUS_REGISTRY = regdir_envvar)
```
## Next Steps
Using the part-of-speech annotation is a basic approach to obtain the data for annotation sentences. An alternative would be to use the NLP annotation machinery of an integrated tool such as Stanford CoreNLP, or OpenNLP. But that's a different story to be told.