sentences.Rmd

---
title: "How to add a sentence annotation to an indexed corpus"
author: "Andreas Blätte (andreas.blaette@uni-due.de)"
date: "`r Sys.Date()`"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{How to add a sentence annotation to an indexed corpus}
  %\VignetteEncoding{UTF-8}
  %\VignetteEngine{knitr::rmarkdown}
editor_options: 
  chunk_output_type: console
---

## About

Adding an annotation of sentences as a structural attribute to an existing is a frequent scenario. This vignette offers a basic recipe for a corpus that already includes a part-of-speech annotation. The GermaParlSample corpus serves as an example.


## Getting started

In addition to the **cwbtools** package, we use the functionality of the [RcppCWB](https://CRAN.R-project.org/package=RcppCWB) package to decode the p-attribute 'pos'. The same could be achieved using the higher-level `get_token_stream()` method of the [polmineR](https://CRAN.R-project.org/package=polmineR) package, but we want to avoid creating an additional dependency of this package.

```{r load_libs, eval = TRUE}
library(cwbtools)
library(RcppCWB)
```

For the purposes of this vignette, we work with a temporary copy of the corpus we wish to augment, so we create a temporary corpus directory.

```{r tmp_dirs_and_data, eval = TRUE}
registry_dir_tmp <- fs::path(tempdir(), "registry_dir_tmp")
corpus_dir_tmp <- fs::path(tempdir(), "corpus_dir_tmp")

dir.create(path = registry_dir_tmp)
dir.create(path = corpus_dir_tmp)

regdir_envvar <- Sys.getenv("CORPUS_REGISTRY")
Sys.setenv(CORPUS_REGISTRY = registry_dir_tmp)
```

The GermaParlSample corpus can be downloaded from [Zenodo](https://doi.org/10.5281/zenodo.3823245) as follows.

```{r get_germaparlsample, eval = TRUE}
is_corpus_available <- cwbtools::corpus_install(
  doi = "10.5281/zenodo.3823245",
  registry_dir = registry_dir_tmp, corpus_dir = corpus_dir_tmp,
  verbose = FALSE
)
```

```{r}
list.files(file.path(corpus_dir_tmp, "germaparlsample"))
```


## Recipe

We generate the data for the sentence annotation from the part-of-speech annotation that is already present.

At first, we decode the p-attribute "pos".

```{r decode_pos, eval = is_corpus_available}
germaparl_size <- cl_attribute_size(
  corpus = "GERMAPARLSAMPLE",
  attribute = "word", attribute_type = "p"
)
cpos_vec <- seq.int(from = 0L, to = germaparl_size - 1L)
ids <- cl_cpos2id(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", cpos = cpos_vec)
pos <- cl_id2str(corpus = "GERMAPARLSAMPLE", p_attribute = "pos", id = ids)
```

The pos-tag for Stuttgart Tuebingen Tag Set (STTS) is a "$.". From this information, we can generate a region matrix with start and end corpus positions of sentences easily.

```{r sentence_regions, eval = is_corpus_available}
sentence_end <- grep("\\$\\.", pos)
sentence_factor <- cut(
  x = cpos_vec, breaks = c(0L, sentence_end),
  include.lowest = TRUE, right = FALSE
)
sentences_cpos <- unname(split(x = cpos_vec, f = sentence_factor))
region_matrix <- do.call(
  rbind,
  lapply(sentences_cpos, function(cpos) c(cpos[1L], cpos[length(cpos)]))
)
```

So let us see what this looks like ...

```{r show_matrix, eval = is_corpus_available}
head(region_matrix)
```

And this is how the new annotation layer is written back to the corpus.

```{r write_annotation, eval = is_corpus_available}
s_attribute_encode(
  values = as.character(seq.int(from = 0L, to = nrow(region_matrix) - 1L)),
  data_dir = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["home"]],
  s_attribute = "s",
  corpus = "GERMAPARLSAMPLE",
  region_matrix = region_matrix,
  method = "R",
  registry_dir = Sys.getenv("CORPUS_REGISTRY"),
  encoding = registry_file_parse(corpus = "GERMAPARLSAMPLE")[["properties"]][["charset"]],
  delete = TRUE,
  verbose = TRUE
)
```


## Checking the result

To see whether everything has worked, we get the left and right boundaries of the sentence with corpus position 60. 

```{r check_annotation, eval = is_corpus_available}
left <- cl_cpos2lbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
right <- cl_cpos2rbound("GERMAPARLSAMPLE", cpos = 60, s_attribute = "s")
ids <- cl_cpos2id("GERMAPARLSAMPLE", cpos = left:right, p_attribute = "word")
cl_id2str("GERMAPARLSAMPLE", p_attribute = "word", id = ids)
```


## Cleaning up

As a matter of housekeeping, we remove temporary directories and restore value of environment variable CORPUS_REGISTRY.

```{r housekeeping, eval = is_corpus_available}
unlink(corpus_dir_tmp)
unlink(registry_dir_tmp)

Sys.setenv(CORPUS_REGISTRY = regdir_envvar)
```

## Next Steps

Using the part-of-speech annotation is a basic approach to obtain the data for annotation sentences. An alternative would be to use the NLP annotation machinery of an integrated tool such as Stanford CoreNLP, or OpenNLP. But that's a different story to be told.