lexvarsdatr

An R package: some tools for investigating lexical variation from both behavioral and distributional perspectives. Including:

A collection of psycholinguistic/behavioral data sets, &
A few functions for extracting semantic associations and network structures from term-feature matrices.

Installation

library(devtools)
devtools::install_github("jaytimm/lexvarsdatr")
library(lexvarsdatr)

Usage

Behavioral data

Behavioral data included in the package: Response times in lexical decision & naming, concreteness ratings, age-of-acquisition (AoA) ratings, and word association norms. Sources are presented below:

Data	Source
Lexical decision and naming	Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., … & Treiman, R. (2007). The English lexicon project. Behavior research methods, 39(3), 445-459.
Concreteness ratings	Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. Behavior research methods, 46(3), 904-911.
AoA ratings	Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. Behavior Research Methods, 44(4), 978-990.
Word association	Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. Behavior Research Methods, Instruments, & Computers, 36(3), 402-407.

Response times in lexical decision/naming, concreteness ratings, and AoA ratings have been collated into a single data frame, lex_behav_data. Approximately 18K word forms are included in all three data sets.

library(tidyverse)
lexvarsdatr::lvdr_behav_data %>% na.omit %>% head()

##            Word             Pron NMorph   POS lexdecRT lexdecSD  nmgRT
## 7        abacus        "a.b@.k@s      1    NN   964.40   489.00 792.69
## 9       abandon       @.b"an.4@n      1 VB|NN   695.72   220.41 623.96
## 14  abandonment  @.b"an.4@n.m@nt      2    NN   771.09   229.53 794.70
## 26   abbreviate     @.br"i.vi.et      3    VB   795.03   316.55 708.44
## 27  abbreviated @.br"i.vi.%e4.@d      4 JJ|VB   698.45   170.37 695.63
## 28 abbreviation @.br%i.vi."e.Sn=      4    NN   728.91   163.59 714.93
##     nmgSD aoaRating aoaSD concRating concSD freqSUBTLEX
## 7  200.19      8.69  3.77       4.52   1.12          12
## 9   98.25      8.32  2.75       2.54   1.45         413
## 14 256.30     10.27  2.57       2.54   1.29          49
## 26 156.29      9.95  2.07       2.59   1.53           1
## 27 201.23     10.50  1.79       3.10   1.54          16
## 28 149.43      9.11  2.37       3.07   1.51          12

The South Florida word association data can be accessed via lvdr_association. A description of variables included in the normed data set, as well as methodologies, can be found here. Word association data is also available as as sparse matrix, lvdr_association_sparse.

Functions

To demonstrate the utility of the functions included in the package, we first create a simple count-based term-feature co-occurrence matrix using US Presidential State of the Union (SOTU) addresses – made available in TIF format via the sotu package. A fairly small corpus at ~2 million words.

Here, we work within the text2vec framework. Window size of co-occurrence, 5x5. For simplicity, we tokenize at the word-level.

library(sotu)
t2v_ents <- text2vec::itoken(sotu::sotu_text, 
                            preprocessor = toupper, 
                            tokenizer = text2vec::word_tokenizer, 
                            ids = 1:236)

vocab <- text2vec::create_vocabulary(t2v_ents, 
                                    stopwords = toupper(tm::stopwords())) 

pruned_vocab <- text2vec::prune_vocabulary(
  vocab, term_count_min = 10, doc_proportion_max = 0.95) %>%
  filter(!grepl('[0-9]', term))

tcm <- text2vec::create_tcm(t2v_ents, 
                           vectorizer = text2vec::vocab_vectorizer(pruned_vocab), 
                           skip_grams_window = 5L,
                           skip_grams_window_context = "symmetric",
                           weight = c(1,1,1,1,1)) #No weight

§ Build PPMI Matrix

The lvdr_calc_ppmi function transforms a count-based co-occurrence matrix to a positive-pointwise mutual information matrix, modified from this SO post.

tcm_ppmi <- tcm %>% 
  lexvarsdatr::lvdr_calc_ppmi(make_symmetric = TRUE)

§ Get collocates, neighbors, etc.

The lvdr_get_closest function can be used to extract the n highest scoring features associated with a term (or set of terms) from a term-feature matrix. Assumes a column-oriented matrix (dgCMatrix) as input. data.table dependency. Modified from the udpipe::as.cooccurrence() function.

Per the SOTU PPMI co-occurrence matrix created above, we extract the ten strongest collocates of the term VIOLENCE. Output is a simple data frame.

lexvarsdatr::lvdr_get_closest(tfm = tcm_ppmi, 
                              #lexvarsdatr::lvdr_association_sparse, 
                              target = 'VIOLENCE', 
                              n = 10) %>%
  knitr::kable(row.names = FALSE)

term	feature	cooc
VIOLENCE	RIOT	6.588764
VIOLENCE	UNDERLIES	6.455232
VIOLENCE	UNPLEASANT	6.337449
VIOLENCE	UNRESTRAINED	6.337449
VIOLENCE	SYMPATHIZE	6.232089
VIOLENCE	PREACH	6.049767
VIOLENCE	INTIMIDATION	5.895617
VIOLENCE	SUPPORTERS	5.895617
VIOLENCE	PROCLAIM	5.701460
VIOLENCE	MEDIA	5.672473
VIOLENCE	SUPPRESSING	5.644302

The function can also be used to extract nearest neighbors from a cosine similarity matrix. To demonstrate, we (1) consolidate feature set to 150 latent dimensions via singular-value decomposition, and then (2) construct cosine-based, term-term similarity matrix.

tcm_svd <-  irlba::irlba (tcm_ppmi, nv = 150)

tcm_svd1 <- as.matrix(data.matrix(tcm_svd$u))
dimnames(tcm_svd1) <- list(rownames(tcm_ppmi), 
                           c(1:length(tcm_svd$d)))

# Create cosine similarity matrix
cos_sim <- text2vec::sim2(x = tcm_svd1, 
                          method = 'cosine', 
                          norm = 'l2')

Per matrix, we extract the five nearest neighbors (ie, ~synonyms) for the terms TARIFF and SCIENCE.

#library(data.table)
lexvarsdatr::lvdr_get_closest(tfm = cos_sim, 
                              target = c('TARIFF','SCIENCE'), 
                              n = 5) %>%
  knitr::kable(row.names = FALSE)

term	feature	cooc
SCIENCE	RESEARCH	0.5654030
SCIENCE	TECHNOLOGY	0.5614927
SCIENCE	SCIENTIFIC	0.4302310
SCIENCE	SPACE	0.4027100
SCIENCE	TECHNOLOGICAL	0.3857384
TARIFF	TAXATION	0.4863266
TARIFF	AD	0.4095864
TARIFF	PROTECTIVE	0.4077748
TARIFF	REVENUE	0.3908531
TARIFF	IMPORTATIONS	0.3863141

§ Build network structure

The lvdr_extract_network function extracts the network structure for a term (or set of terms) from a term-feature matrix (again, as dgCMatrix). The function is built on lvdr_get_closest(). Output is a list that includes a node data frame and an edges data frame, structured to play nice with the tidygraph and ggraph plotting paradigms.

The number of nodes (per term) to include in the network is specified by the n parameter, ie, the n highest scoring features associated with a term from a term-feature matrix. Term-nodes and feature-nodes are distinguished in the output for visualization purposes. If multiple terms are specified, nodes are filtered to the strongest (ie, primary) term-feature relationships (to remove potential duplicates).

Edges include the n-highest scoring term-feature associations for specified terms, as well as the n most frequent node-node associations per node (term & feature).

network <- lexvarsdatr::lvdr_extract_network (tfm = tcm_ppmi, 
                                              target = toupper(c('enemy', 'ally', 
                                                                 'friend', 'partner')),
                                              n = 15)

Quick note: Algorithms like GloVe, SVD & word2vec abstract over the term-feature associations that underlie (distributionally-derived) semantic relationships. Visualizing the network structure of semantically related terms based in actual co-occurrence can help shed light on the sources of relatedness in ways that, eg, latent dimensions cannot.

The plot below illustrates the network structure (based on the PPMI term-feature matrix for the SOTU corpus) for a set of semantically related terms: ENEMY, ALLY, FRIEND, and PARTNER. Terms are identified as triangles; features as circles. Color is used to specify primary term-feature relationships. Circle size specifies the (relative) strength of association between primary term and feature.

set.seed(66)
network %>%
  tidygraph::as_tbl_graph() %>%
  ggraph::ggraph() +
  
  ggraph::geom_edge_link(color = 'darkgray') + 
  ggraph::geom_node_point(aes(size = value, 
                              color = term,
                              shape = group)) +
  
  ggraph::geom_node_text(aes(label = toupper(label), 
                             filter = group == 'term'), 
                             repel = TRUE, size = 4) +
  
  ggraph::geom_node_text(aes(label = tolower(label), 
                             filter = group == 'feature'), 
                             repel = TRUE, size = 3) +
  ggthemes::scale_color_stata()+
  ggtitle('sotu co-occurrence network') +
  theme(legend.position = "none")

Another take using the word association data set, lvdr_association_sparse:

network2 <- lexvarsdatr::lvdr_extract_network(
  tfm = lexvarsdatr::lvdr_association_sparse,
  target = toupper(c('enemy', 'ally', 
                     'friend', 'partner')), 
  n = 15)

set.seed(11)
network2 %>%
  tidygraph::as_tbl_graph() %>%
  ggraph::ggraph() +
  
  ggraph::geom_edge_link(color = 'darkgray') + #alpha = 0.8
  ggraph::geom_node_point(aes(size = value, 
                              color = term,
                              shape = group)) +
  
  ggraph::geom_node_text(aes(label = toupper(label), 
                             filter = group == 'term'), 
                             repel = TRUE, size = 4) +
  
  ggraph::geom_node_text(aes(label = tolower(label), 
                             filter = group == 'feature'), 
                             repel = TRUE, size = 3) +
  ggthemes::scale_color_stata()+
  ggtitle('word association norms network') +
  theme(legend.position = "none")

Name		Name	Last commit message	Last commit date
Latest commit History 111 Commits
R		R
README_files/figure-markdown_github		README_files/figure-markdown_github
data-raw		data-raw
data		data
man		man
psycho		psycho
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.md		README.md
lexvarsdatr.Rproj		lexvarsdatr.Rproj

jaytimm/lexvarsdatr

Folders and files

Latest commit

History

Repository files navigation

lexvarsdatr

Installation

Usage

Behavioral data

Functions

§ Build PPMI Matrix

§ Get collocates, neighbors, etc.

§ Build network structure

About

Topics

Resources

Stars

Watchers

Forks

Languages