tokenvars

At the moment, this package is super experimental and cannot be considered easy to use. Even when it is the case, this is mostly an infrastructural R package for a very niche category of developers wanting to develop R packages for quanteda.

quanteda has good support for metadata. However, one can only put corpus- and document-level metadata (meta(), docvars(), respectively). This package aims at going down one level and provides support for token-level metadata. Token-level metadata is useful for tagging individual token (e.g. Parts of Speech, relationships among tokens); it is also useful to store upper-level information of tokens (e.g. the subword tokenized sequence of tokens “_L”, “’”, “app”, “ar”, “tement”; you might want to know “_L” is from the French word “L’appartement”).

Installation

You can install the development version of tokenvars like so:

# Well, if you don't know how to do this, you probably shouldn't try this.

A demo of using token-level metadata

library(quanteda)
#> Package version: 3.3.1
#> Unicode version: 14.0
#> ICU version: 70.1
#> Parallel computing: 8 of 8 threads used.
#> See https://quanteda.io for tutorials and examples.
library(tokenvars)

corp <- corpus(c(d1 = "spaCy is great at fast natural language processing.",
                 d2 = "Mr. Smith spent two years in North Carolina."))

tok <- tokens(corp) %>% tokens_add_tokenvars()
tok
#> Tokens consisting of 2 documents and 1 docvar.
#> d1 :
#> t1>"spaCy" t2>"is" t3>"great" t4>"at" t5>"fast" t6>"natural" t7>"language" t8>"processing" t9>"." 
#> d2 :
#> t1>"Mr" t2>"." t3>"Smith" t4>"spent" t5>"two" t6>"years" t7>"in" t8>"North" t9>"Carolina" t10>"."

tokenvars(tok) ## nothing to see here
#> $d1
#> data frame with 0 columns and 9 rows
#> 
#> $d2
#> data frame with 0 columns and 10 rows

tokenvars(tok, "tag") <- list(c("NNP", "VBZ", "JJ", "IN", "JJ", "JJ", "NN", "NN", "."),
                              c("NNP", ".", "NNP", "VBD", "CD", "NNS", "IN", "NNP", "NNP", "."))
tokenvars(tok, "lemma") <- list(c("spaCy", "be", "great", "at", "fast", "natural", "language", "processing", "."),
                                c("Mr", ".", "Smith", "spend", "two", "year", "in", "North", "Carolina", "."))

tok
#> Tokens consisting of 2 documents and 1 docvar.
#> Token variables: (tag|lemma).
#> d1 :
#> t1>"spaCy"(NNP|spaCy) t2>"is"(VBZ|be) t3>"great"(JJ|great) t4>"at"(IN|at) t5>"fast"(JJ|fast) t6>"natural"(JJ|natural) t7>"language"(NN|language) t8>"processing"(NN|processing) t9>"."(.|.) 
#> d2 :
#> t1>"Mr"(NNP|Mr) t2>"."(.|.) t3>"Smith"(NNP|Smith) t4>"spent"(VBD|spend) t5>"two"(CD|two) t6>"years"(NNS|year) t7>"in"(IN|in) t8>"North"(NNP|North) t9>"Carolina"(NNP|Carolina) t10>"."(.|.)

tokenvars(tok)
#> $d1
#>   tag      lemma
#> 1 NNP      spaCy
#> 2 VBZ         be
#> 3  JJ      great
#> 4  IN         at
#> 5  JJ       fast
#> 6  JJ    natural
#> 7  NN   language
#> 8  NN processing
#> 9   .          .
#> 
#> $d2
#>    tag    lemma
#> 1  NNP       Mr
#> 2    .        .
#> 3  NNP    Smith
#> 4  VBD    spend
#> 5   CD      two
#> 6  NNS     year
#> 7   IN       in
#> 8  NNP    North
#> 9  NNP Carolina
#> 10   .        .

tokenvars(tok, field = "tag")
#> $d1
#> [1] "NNP" "VBZ" "JJ"  "IN"  "JJ"  "JJ"  "NN"  "NN"  "."  
#> 
#> $d2
#>  [1] "NNP" "."   "NNP" "VBD" "CD"  "NNS" "IN"  "NNP" "NNP" "."

tokenvars(tok, field = "lemma", docnames = "d2")
#> $d2
#>  [1] "Mr"       "."        "Smith"    "spend"    "two"      "year"    
#>  [7] "in"       "North"    "Carolina" "."

tokens_proximity

tokens_proxmity is a showcase of tokenvars for calculating and manipulating a token-level metadata. “proximity” is a token-level metadata of the distance between a target pattern and all other tokens.

txt1 <-
c("Turkish President Tayyip Erdogan, in his strongest comments yet on the Gaza conflict, said on Wednesday the Palestinian militant group Hamas was not a terrorist organisation but a liberation group fighting to protect Palestinian lands.",
"EU policymakers proposed the new agency in 2021 to stop financial firms from aiding criminals and terrorists. Brussels has so far relied on national regulators with no EU authority to stop money laundering and terrorist financing running into billions of euros.")
tok1 <- txt1 %>% tokens() %>%
    tokens_proximity(pattern = "turkish")
tok1
#> Tokens consisting of 2 documents and 1 docvar.
#> Token variables: (proximity).
#> text1 :
#> t1>"turkish"(1) t2>"president"(2) t3>"tayyip"(3) t4>"erdogan"(4) t5>","(5) t6>"in"(6) t7>"his"(7) t8>"strongest"(8) t9>"comments"(9) t10>"yet"(10) t11>"on"(11) t12>"the"(12) { ... and 26 more }
#> 
#> text2 :
#> t1>"eu"(44) t2>"policymakers"(44) t3>"proposed"(44) t4>"the"(44) t5>"new"(44) t6>"agency"(44) t7>"in"(44) t8>"2021"(44) t9>"to"(44) t10>"stop"(44) t11>"financial"(44) t12>"firms"(44) { ... and 31 more }

tokenvars(tok1, "proximity")
#> $text1
#>  [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
#> [26] 26 27 28 29 30 31 32 33 34 35 36 37 38
#> 
#> $text2
#>  [1] 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44
#> [26] 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44 44

The tokens object with proximity vectors can be converted to a (weighted) dfm (Document-Feature Matrix). The default weight is assigned by inverting the proximity.

dfm(tok1)
#> Document-feature matrix of: 2 documents, 64 features (45.31% sparse) and 0 docvars.
#>        features
#> docs    turkish president    tayyip erdogan         ,         in       his
#>   text1       1       0.5 0.3333333    0.25 0.2666667 0.16666667 0.1428571
#>   text2       0       0   0            0    0         0.02272727 0        
#>        features
#> docs    strongest  comments yet
#>   text1     0.125 0.1111111 0.1
#>   text2     0     0         0  
#> [ reached max_nfeat ... 54 more features ]

You have the freedom to change to another weight function. For example, not inverting.

dfm(tok1, weight_function = identity)
#> Document-feature matrix of: 2 documents, 64 features (45.31% sparse) and 0 docvars.
#>        features
#> docs    turkish president tayyip erdogan  , in his strongest comments yet
#>   text1       1         2      3       4 20  6   7         8        9  10
#>   text2       0         0      0       0  0 44   0         0        0   0
#> [ reached max_nfeat ... 54 more features ]

Or any custom function

dfm(tok1, weight_function = function(x) { 1 / x^2 })
#> Document-feature matrix of: 2 documents, 64 features (45.31% sparse) and 0 docvars.
#>        features
#> docs    turkish president    tayyip erdogan          ,           in        his
#>   text1       1      0.25 0.1111111  0.0625 0.04444444 0.0277777778 0.02040816
#>   text2       0      0    0          0      0          0.0005165289 0         
#>        features
#> docs    strongest   comments  yet
#>   text1  0.015625 0.01234568 0.01
#>   text2  0        0          0   
#> [ reached max_nfeat ... 54 more features ]

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
R		R
data		data
man		man
src		src
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

R

R

data

data

man

man

src

src

tests

tests

vignettes

vignettes

.Rbuildignore

.Rbuildignore

.gitignore

.gitignore

DESCRIPTION

DESCRIPTION

LICENSE.md

LICENSE.md

NAMESPACE

NAMESPACE

README.Rmd

README.Rmd

README.md

README.md

Repository files navigation

tokenvars

Installation

A demo of using token-level metadata

tokens_proximity

About

Releases

Packages

Languages

License

gesistsa/tokenvars

Folders and files

Latest commit

History

Repository files navigation

tokenvars

Installation

A demo of using token-level metadata

tokens_proximity

About

Resources

License

Stars

Watchers

Forks

Languages