added more comprehensive README and introduced cli

PolMine · Apr 22, 2023 · 31aa59b · 31aa59b
1 parent 60543c9
commit 31aa59b
Show file tree

Hide file tree

Showing 11 changed files with 629 additions and 200 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,14 +1,15 @@
 Package: LinkTools
 Type: Package
 Title: LinkTools
-Version: 0.0.1.9004
-Date: 2023-04-19
+Version: 0.0.1.9005
+Date: 2023-04-22
 Author: Christoph Leonhardt
-Maintainer: Christoph Leonhardt <someemail@someemailasdasdasdadsadasd.de>
-Description: This package facilitates the linkage of datasets via shared unique identifiers. Four steps are integrated into this package: a) the preparation of datasets which should be linked, i.e. the transformation into a comparable format and the assignment of shared unique identifiers, b) the merge of datasets based on these identifiers, c) the encoding or enrichment of the data with three output formats (data.table, XML or CWB). In addition, d), the package includes a wrapper for the Named Entity Linking of textual data based on DBPedia Spotlight. 
+Maintainer: Christoph Leonhardt <christoph.leonhardt@uni-due.de>
+Description: This package facilitates the linkage of datasets. Once finished, four steps should be integrated into this package: a) the preparation of datasets which should be linked, i.e. the transformation into a comparable format and the assignment of new information, in particular shared unique identifiers, b) the merge of datasets based on these identifiers, c) the encoding or enrichment of the data with different output formats (data.table, XML or CWB). In addition, d), the package includes a wrapper for the Named Entity Linking of textual data based on DBpedia Spotlight.
 Depends:
   R (>= 3.5.0)
 Imports:
+    cli,
     data.table,
     cwbtools,
     polmineR,

diff --git a/LinkTools_interactive_matching_gui_README.png b/LinkTools_interactive_matching_gui_README.png
diff --git a/NAMESPACE b/NAMESPACE
@@ -4,6 +4,12 @@ export(LTDataset)
 importFrom(R6,R6Class)
 importFrom(RcppCWB,cl_cpos2struc)
 importFrom(RcppCWB,cl_struc2str)
+importFrom(cli,cli_abort)
+importFrom(cli,cli_alert_info)
+importFrom(cli,cli_alert_success)
+importFrom(cli,cli_alert_warning)
+importFrom(cli,cli_progress_message)
+importFrom(cli,cli_progress_update)
 importFrom(cwbtools,registry_file_parse)
 importFrom(cwbtools,s_attribute_encode)
 importFrom(data.table,":=")

diff --git a/NEWS.md b/NEWS.md
@@ -1,3 +1,9 @@
+V0.0.1.9005
+* replaced `\code{}` tags in the documentation
+* reduced verbosity of intermediate steps
+* introduced a check if the external dataset contains NA values in significant columns before fuzzy matching
+* introduced messages from the `cli` package
+
 [2023-04-19]
 * addressed a quite comprehensive issue in `external_attribute_to_region_matrix()` that potentially obscured speakers which were not matched, making them unavailable for both the fuzzy matching and manual inspection (issue #14)
 * introduced tests

diff --git a/R/LTDataset.R b/R/LTDataset.R
diff --git a/R/LinkTools.R b/R/LinkTools.R
@@ -1,18 +1,18 @@
 #' R-package 'LinkTools'
-#' 
-#' Tools for linking data sets via shared unique identifiers.
-#' 
-#' Four steps are integrated into this package:
-#' 
+#'
+#' Tools for linking data sets.
+#'
+#' Once finalized, four steps should be integrated into this package:
+#'
 #' a) the preparation of data sets which should be linked, i.e. the
-#' transformation into a comparable format and the assignment of shared unique
-#' identifiers,
-#' 
+#' transformation into a comparable format and the assignment of new
+#' information, in particular shared unique identifiers,
+#'
 #' b) the merge of data sets based on these identifiers,
-#' 
-#' c) the encoding or enrichment of the data with three output formats
+#'
+#' c) the encoding or enrichment of the data with different output formats
 #' (data.table, XML or CWB).
-#' 
+#'
 #' d) the package includes a wrapper for the Named Entity Linking of textual
 #' data based on DBpedia Spotlight.
 #' @keywords package

diff --git a/README.Rmd b/README.Rmd
@@ -14,22 +14,194 @@ output: github_document
 
 ## About LinkTools
 
+### Motivation
 
-## Dependencies
+`LinkTools` facilitates the linkage of datasets by providing accessible approaches to two tasks: Record Linkage and Entity Linking. It is developed in the measure [Linking Textual Data](https://www.konsortswd.de/en/konsortswd/the-consortium/services/linking-textual-data/) in [KonsortSWD](https://www.konsortswd.de/en/) within the [National Research Data Infrastructure Germany](https://www.nfdi.de/?lang=en) (NFDI) and has the goal to make linking textual data more accessible.
 
-* running the Vignette requires the availability of the GermaParl CWB corpus
+### Purpose
 
+Once finalized, four steps should be integrated into this R package:
 
-## Installation
+* the preparation of datasets which should be linked, i.e. the transformation into a comparable format and the assignment of new information, in particular shared unique identifiers
+* the merge of datasets based on these identifiers
+* the encoding or enrichment of the data with different output formats (data.table, XML or CWB)
+* the package includes a wrapper for the Named Entity Linking of textual data based on DBpedia Spotlight
 
-```{r install_from_github, eval = FALSE}
-remotes::install_github("PolMine/LinkTools")
+A major focus of this package is the provision of an intuitive workflow with transparency and robustness. Documentation, validity and user experience as well as training and education are at the heart of this development. Consequently, the processes of linkage and linking should be designed as approachable as possible - using GUIs and feedback - as well as robust and repeatable.
+
+## Current Status
+
+The Record Linkage functionality is maturing and the current state is documented in the package vignette. Record Linkage with CWB corpora is currently implemented. Entity Linking using [DBpedia Spotlight](https://www.dbpedia-spotlight.org/) as the backend is currently in development.
+
+## Core functionality of LinkTools
+
+```{r load_LinkTools_package}
+library(LinkTools)
 ```
 
-```{r install_devversion_from_github, eval = FALSE}
-remotes::install_github("PolMine/LinkTools", ref = "dev")
+### Record Linkage
+
+The `LTDataset` class, an R6 class, is the main driver of the Record Linkage process within the `LinkTools` package. Aside from the wrangling of the textual data input, its core functionality includes the merge of metadata found in text corpora and external datasets. To this end, first, a strict merge of exact matches is performed. Thereafter, if observations could not be joined directly, a fuzzy matching approach is possible in which `LinkTools` suggests matches based on different measures of string distance. In this stage, the manual inspection of these suggestions and the addition of missing values is possible.
+
+The vignette shows this for larger data and a real external dataset. In the following a short artificial example is provided. It is assumed that the following two resources should be linked:
+
+### Text corpus
+
+```{r load_polmineR_and_germaparlmini}
+library(polmineR)
+use("polmineR")
 ```
 
-## Current Status
+GermaParlMini is a sample corpus of the larger GermaParl corpus of parliamentary debates. It is provided by the `polmineR` R package. It contains a number of metadata such as the speaker name and the party of a speaker. To show the capabilities of the tool, a small sample of this dataset is used.
+
+```{r subset_germaparlmini}
+germaparlmini_session <- polmineR::corpus("GERMAPARLMINI") |>
+  polmineR::subset(date == "2009-11-12")
+```
+
+The metadata used for linking looks like the following:
+
+```{r retrieve_attributes, echo = FALSE}
+speaker_s_attributes <- germaparlmini_session |>
+  polmineR::s_attributes(c("speaker", "party")) |>
+  data.table::setorder(speaker)
 
-* Currently implemented: Record Linkage with CWB corpora
+speaker_s_attributes |>
+  knitr::kable(format = "markdown")
+```
+
+### External Data
+
+To show the (fuzzy) matching possibilities of the package, an artificial dataset is created on the spot. It is created by modifying the speaker data found in the textual data by introducing some deviation. In consequence, it contains most of the speakers in the textual data shown above as well as the same metadata plus a variable called "ID" which represents the additional information we want to add to the textual data. To showcase the fuzzy matching, this artificial dataset also contains some typos in the names of the speakers as well as a differently named column for the speaker names.
+
+For a real dataset, please see the package vignette.
+
+```{r make_artificial_dataset, echo = FALSE}
+set.seed(24)
+artificial_id_data <- data.table::copy(speaker_s_attributes)
+
+# to show what happens if not all observations can be matched, remove one random
+# row.
+
+row_to_remove_idx <- sample(1:nrow(artificial_id_data), 1)
+artificial_id_data <- artificial_id_data[-row_to_remove_idx, ]
+
+# a single purpose function to mildly modify speaker names
+scramble_name <- function(x) {
+  to_scramble <- sample(c(1, 2), 1)
+  # only potentially modify about every second name
+  if (to_scramble == 1) {
+
+    # separate first and last elements in name column
+    name_elements_split <- unlist(strsplit(x, " "))
+    last_name <- name_elements_split[length(name_elements_split)]
+    first_name <- name_elements_split[1:(length(name_elements_split) - 1)]
+
+    # split into characters
+    name_as_chars <- unlist(strsplit(last_name, ""))
+
+    # to scramble area in the middle of the name, select two characters in the
+    # middle
+    min <- length(name_as_chars) - floor(length(name_as_chars) / 2)
+    max <- min + 1
+    middle_sector <- name_as_chars[min:max]
+    middle_sector_scrambled <- sample(middle_sector, length(middle_sector))
+
+    # get rest of the name. Short names might cause errors.
+    start_sector <- name_as_chars[1:(min-1)]
+    if (max == length(name_as_chars)) {
+      end_sector <- NULL
+    } else {
+      end_sector <- name_as_chars[(max+1):length(name_as_chars)]
+    }
+
+    # paste again
+    last_name_mod <- paste(c(start_sector,
+                             middle_sector_scrambled,
+                             end_sector),
+                           collapse = "")
+
+    name_scrambled <- paste(c(first_name, last_name_mod), collapse = " ")
+    return(name_scrambled)
+  } else {
+    return(x)
+  }
+}
+
+artificial_id_data[, speaker := scramble_name(speaker), by = seq_len(nrow(artificial_id_data))]
+artificial_id_data[, id := sprintf("ID_%s", 1:nrow(artificial_id_data))]
+data.table::setnames(artificial_id_data, old = "speaker", new = "name")
+
+artificial_id_data |>
+  knitr::kable(format = "markdown")
+```
+
+### Step 1: Instantiating the `LTDataset` class
+
+The `LTDataset` class is instantiated with the names of the two resources and some additional information. The package vignette shows some more arguments. Also see `?LTDataset` for more in-depth documentation of these arguments.
+
+```{r instantiate_LTDataset}
+LTD <- LTDataset$new(textual_data = germaparlmini_session,
+                     textual_data_type = "cwb",
+                     external_resource = artificial_id_data,
+                     attr_to_add = c("id_in_corpus" = "id"),
+                     match_by = c("speaker" = "name",
+                                  "party" = "party"),
+                     forced_encoding = "UTF-8")
+```
+
+### Step 2: Join strictly
+
+With the shared attributes provided in the `match_by` argument, the two datasets are joined.
+
+```{r join_data}
+LTD$join_textual_and_external_data()
+```
+
+After this join, a data.table object can be created. This can be used to inspect the results of the join directly. Each row in which the ID column is "NA" was not matched.
+
+```{r create_region_datatable}
+LTD$create_attribute_region_datatable()
+```
+
+```{r print_region_datatable}
+LTD$attrs_by_region_dt |>
+  unique() |>
+  data.table::setorder(speaker) |>
+  knitr::kable(format = "markdown")
+```
+
+
+### Step 3: Check and add missing values
+
+If there are cases in which a match was not possible - for example because of slight differences in the two datasets - manual annotation supported by fuzzy matching is possible.
+
+In an interactive process, realized with `shiny`, suggestions based on the fuzzy matching of attributes are provided and can be checked manually. Remaining missing attributes can be added to the metadata of the corpus.
+
+```{r check_missing, eval = FALSE}
+LTD$check_and_add_missing_values(modify = TRUE,
+                                 match_fuzzily_by = "speaker",
+                                 doc_dir = tempdir())
+```
+
+The screenshot below should serve as an illustration of this interactive element.
+
+![Interactive and manual control of suggestions in LinkTools v0.0.1.9005](LinkTools_interactive_matching_gui_README.png)
+
+A final functionality allows to write the new attribute back into the CWB corpus.
+
+## LinkTools and other Resources
+
+The purpose of `LinkTools` is to provide an integrated user experience for interested persons wanting to link textual data with other types of data. The package thus will handle different data types which are important in the realm of textual data, taking care of preprocessing and the enrichment of the data in a robust and transparent manner. It provides access to different existing approaches of linking and linkage, lowering barriers to make use of available resources.
+
+Focusing on textual data, the code base of the [PolMine project](https://polmine.github.io/) is at the core of `LinkTools`. In particular, the R package [polmineR](https://cran.r-project.org/package=polmineR) is used to manage large CWB corpora. In the future, a broader scope of input will be covered. The strict merging in the record linkage workflow is mainly realized with [data.table](https://cran.r-project.org/package=data.table). The fuzzy - or probabilistic - joins are mainly realized using the R packages [fuzzyjoin](https://cran.r-project.org/package=fuzzyjoin) and [stringdist](https://cran.r-project.org/package=stringdist).
+
+## Dependencies
+
+Running the Vignette requires the availability of the GermaParlMini CWB corpus and the `btmp` data package which provides the external data for linking.
+
+## Installation
+
+```{r install_from_github, eval = FALSE}
+remotes::install_github("PolMine/LinkTools")
+```