Skip to content

Commit

Permalink
README augmented
Browse files Browse the repository at this point in the history
  • Loading branch information
Andreas Blätte authored and Andreas Blätte committed Jul 9, 2023
1 parent b54f663 commit 1f24cce
Show file tree
Hide file tree
Showing 3 changed files with 34 additions and 5 deletions.
13 changes: 13 additions & 0 deletions README.Rmd
Expand Up @@ -31,3 +31,16 @@ ExplorationWorkbench" presented at LREC 2014
The main function is `detect_duplicates()`.


## Related work

Near duplicate detection is a standard NLP task. There is a wide range of
algorithms that are used for near duplicate detection and there is a broad set
of implementations in the programming languages used for NLP tasks.

In the R context, the [textreuse](https://CRAN.R-project.org/package=textreuse)
package is the point of reference for duplicate detection. The use case for the
*duplicates* package is large corpora that have been indexed with the Corpus
Workbench (CWB). The hashing step which is a selling point for the textreuse
package is performed already, and requirements for tokenizing and hashing the
data are not replicated. The scenario for using the duplicates package is
large, CWB-indexed corpora.
16 changes: 16 additions & 0 deletions README.md
Expand Up @@ -18,3 +18,19 @@ The package implements a procedure described by Fritz Kliche, Andre
Blessing, Urlich Heid and Jonathan Sonntag in the paper “The eIdentity
Text ExplorationWorkbench” presented at LREC 2014 (see ). The main
function is `detect_duplicates()`.

## Related work

Near duplicate detection is a standard NLP task. There is a wide range
of algorithms that are used for near duplicate detection and there is a
broad set of implementations in the programming languages used for NLP
tasks.

In the R context, the
[textreuse](https://CRAN.R-project.org/package=textreuse) package is the
point of reference for duplicate detection. The use case for the
*duplicates* package is large corpora that have been indexed with the
Corpus Workbench (CWB). The hashing step which is a selling point for
the textreuse package is performed already, and requirements for
tokenizing and hashing the data are not replicated. The scenario for
using the duplicates package is large, CWB-indexed corpora.
10 changes: 5 additions & 5 deletions vignettes/vignette.Rmd
Expand Up @@ -11,15 +11,15 @@ editor_options:
chunk_output_type: console
---

```{r}
```{r load_libraries}
library(polmineR)
library(duplicates)
```


## Prune vocabulary

```{r}
```{r prune_vocab}
use(pkg = "duplicates")
charcount <- corpus("REUTERS2") %>%
Expand Down Expand Up @@ -50,7 +50,7 @@ dupl <- docsimil(
## Run duplicate detection


```{r}
```{r duplicate_detection}
x <- corpus("REUTERS2") |>
split(s_attribute = "doc_id")
Expand All @@ -66,7 +66,7 @@ dupl <- docsimil(

## Write to corpus

```{r}
```{r get_annotation_data}
groups <- docgroups(dupl)
annodata <- duplicates_as_annotation_data(
Expand All @@ -77,7 +77,7 @@ annodata <- duplicates_as_annotation_data(
```


```{r}
```{r encode}
library(cwbtools)
regdata <- cwbtools::registry_file_parse(corpus = "REUTERS2", registry = registry())
Expand Down

0 comments on commit 1f24cce

Please sign in to comment.