README augmented

PolMine · Jul 9, 2023 · 1f24cce · 1f24cce
1 parent b54f663
commit 1f24cce
Show file tree

Hide file tree

Showing 3 changed files with 34 additions and 5 deletions.
diff --git a/README.Rmd b/README.Rmd
@@ -31,3 +31,16 @@ ExplorationWorkbench" presented at LREC 2014
 The main function is `detect_duplicates()`.
 
 
+## Related work
+
+Near duplicate detection is a standard NLP task. There is a wide range of
+algorithms that are used for near duplicate detection and there is a broad set
+of implementations in the programming languages used for NLP tasks.
+
+In the R context, the [textreuse](https://CRAN.R-project.org/package=textreuse)
+package is the point of reference for duplicate detection. The use case for the
+*duplicates* package is large corpora that have been indexed with the Corpus
+Workbench (CWB). The hashing step which is a selling point for the textreuse
+package is performed already, and requirements for tokenizing and hashing the
+data are not replicated. The scenario for using the duplicates package is 
+large, CWB-indexed corpora.
diff --git a/README.md b/README.md
@@ -18,3 +18,19 @@ The package implements a procedure described by Fritz Kliche, Andre
 Blessing, Urlich Heid and Jonathan Sonntag in the paper “The eIdentity
 Text ExplorationWorkbench” presented at LREC 2014 (see ). The main
 function is `detect_duplicates()`.
+
+## Related work
+
+Near duplicate detection is a standard NLP task. There is a wide range
+of algorithms that are used for near duplicate detection and there is a
+broad set of implementations in the programming languages used for NLP
+tasks.
+
+In the R context, the
+[textreuse](https://CRAN.R-project.org/package=textreuse) package is the
+point of reference for duplicate detection. The use case for the
+*duplicates* package is large corpora that have been indexed with the
+Corpus Workbench (CWB). The hashing step which is a selling point for
+the textreuse package is performed already, and requirements for
+tokenizing and hashing the data are not replicated. The scenario for
+using the duplicates package is large, CWB-indexed corpora.
diff --git a/vignettes/vignette.Rmd b/vignettes/vignette.Rmd
@@ -11,15 +11,15 @@ editor_options:
   chunk_output_type: console
 ---
 
-```{r}
+```{r load_libraries}
 library(polmineR)
 library(duplicates)
 ```
 
 
 ## Prune vocabulary
 
-```{r}
+```{r prune_vocab}
 use(pkg = "duplicates")
 
 charcount <- corpus("REUTERS2") %>%
@@ -50,7 +50,7 @@ dupl <- docsimil(
 ## Run duplicate detection
 
 
-```{r}
+```{r duplicate_detection}
 x <- corpus("REUTERS2") |>
   split(s_attribute = "doc_id")
 
@@ -66,7 +66,7 @@ dupl <- docsimil(
 
 ## Write to corpus
 
-```{r}
+```{r get_annotation_data}
 groups <- docgroups(dupl)
 
 annodata <- duplicates_as_annotation_data(
@@ -77,7 +77,7 @@ annodata <- duplicates_as_annotation_data(
 ```
 
 
-```{r}
+```{r encode}
 library(cwbtools)
 
 regdata <- cwbtools::registry_file_parse(corpus = "REUTERS2", registry = registry())