README.Rmd


---
output:
  md_document:
    variant: markdown_github
---


# lexvarsdatr

An R package: some tools for investigating lexical variation from both behavioral and distributional perspectives. Including: 

1. A collection of psycholinguistic/behavioral data sets,  & 

2. A few functions for extracting semantic associations and network structures from term-feature matrices.


## Installation

```{r eval = FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(devtools)
devtools::install_github("jaytimm/lexvarsdatr")
library(lexvarsdatr) 
```


## Usage

### Behavioral data

Behavioral data included in the package: Response times in lexical decision & naming, concreteness ratings, age-of-acquisition (AoA) ratings, and word association norms.  Sources are presented below: 

```{r echo=FALSE, message=FALSE, warning=FALSE}
Data <- c("Lexical decision and naming","Concreteness ratings","AoA ratings", "Word association")
Source <- c( "Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. *Behavior research methods*, 39(3), 445-459.","Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. *Behavior research methods*, 46(3), 904-911.","Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. *Behavior Research Methods*, 44(4), 978-990.","Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. *Behavior Research Methods, Instruments, & Computers*, 36(3), 402-407.")

knitr::kable(data.frame(Data=Data, Source=Source))
```


Response times in lexical decision/naming, concreteness ratings, and AoA ratings have been collated into a single data frame, `lex_behav_data`.  Approximately 18K word forms are included in all three data sets.  

```{r message=FALSE, warning=FALSE}
library(tidyverse)
lexvarsdatr::lvdr_behav_data %>% na.omit %>% head()
```


The **South Florida word association data** can be accessed via `lvdr_association`.  A description of variables included in the normed data set, as well as methodologies, can be found [here](http://w3.usf.edu/FreeAssociation/).  Word association data is also available as as sparse matrix, `lvdr_association_sparse`.


### Functions 

To demonstrate the utility of the functions included in the package, we first create a simple count-based term-feature co-occurrence matrix using US Presidential State of the Union (SOTU) addresses -- made available in TIF format via the `sotu` package.  A fairly small corpus at ~2 million words.

Here, we work within the `text2vec` framework.  Window size of co-occurrence, 5x5.  For simplicity, we tokenize at the word-level.

```{r message=FALSE, warning=FALSE}
library(sotu)
t2v_ents <- text2vec::itoken(sotu::sotu_text, 
                            preprocessor = toupper, 
                            tokenizer = text2vec::word_tokenizer, 
                            ids = 1:236)

vocab <- text2vec::create_vocabulary(t2v_ents, 
                                    stopwords = toupper(tm::stopwords())) 

pruned_vocab <- text2vec::prune_vocabulary(
  vocab, term_count_min = 10, doc_proportion_max = 0.95) %>%
  filter(!grepl('[0-9]', term))

tcm <- text2vec::create_tcm(t2v_ents, 
                           vectorizer = text2vec::vocab_vectorizer(pruned_vocab), 
                           skip_grams_window = 5L,
                           skip_grams_window_context = "symmetric",
                           weight = c(1,1,1,1,1)) #No weight
```


### § Build PPMI Matrix

The `lvdr_calc_ppmi` function transforms a count-based co-occurrence matrix to a positive-pointwise mutual information matrix, modified from this [SO post](https://stackoverflow.com/questions/43354479/how-to-efficiently-calculate-ppmi-on-a-sparse-matrix-in-r).

```{r}
tcm_ppmi <- tcm %>% 
  lexvarsdatr::lvdr_calc_ppmi(make_symmetric = TRUE)
```


### § Get collocates, neighbors, etc.

The `lvdr_get_closest` function can be used to extract the `n` highest scoring features associated with a term (or set of terms) from a term-feature matrix.  Assumes a column-oriented matrix (`dgCMatrix`) as input. `data.table` dependency. Modified from the `udpipe::as.cooccurrence()` function. 

Per the SOTU PPMI co-occurrence matrix created above, we extract the ten strongest **collocates** of the term VIOLENCE. Output is a simple data frame.   


```{r message=FALSE, warning=FALSE}
lexvarsdatr::lvdr_get_closest(tfm = tcm_ppmi, 
                              #lexvarsdatr::lvdr_association_sparse, 
                              target = 'VIOLENCE', 
                              n = 10) %>%
  knitr::kable(row.names = FALSE)
```


The function can also be used to extract **nearest neighbors** from a cosine similarity matrix.  To demonstrate, we (1) consolidate feature set to 150 latent dimensions via singular-value decomposition, and then (2) construct cosine-based, term-term similarity matrix.


```{r}
tcm_svd <-  irlba::irlba (tcm_ppmi, nv = 150)

tcm_svd1 <- as.matrix(data.matrix(tcm_svd$u))
dimnames(tcm_svd1) <- list(rownames(tcm_ppmi), 
                           c(1:length(tcm_svd$d)))

# Create cosine similarity matrix
cos_sim <- text2vec::sim2(x = tcm_svd1, 
                          method = 'cosine', 
                          norm = 'l2')
```


Per matrix, we extract the five **nearest neighbors** (ie, ~synonyms) for the terms TARIFF and SCIENCE.  

```{r message=FALSE, warning=FALSE}
#library(data.table)
lexvarsdatr::lvdr_get_closest(tfm = cos_sim, 
                              target = c('TARIFF','SCIENCE'), 
                              n = 5) %>%
  knitr::kable(row.names = FALSE)
```


### § Build network structure

The `lvdr_extract_network` function extracts the network structure for a term (or set of terms) from a term-feature matrix (again, as `dgCMatrix`).  The function is built on `lvdr_get_closest()`.  Output is a list that includes a `node` data frame and an `edges` data frame, structured to play nice with the `tidygraph` and `ggraph` plotting paradigms.

The number of **nodes** (per term) to include in the network is specified by the `n` parameter, ie, the `n` highest scoring features associated with a term from a term-feature matrix.  Term-nodes and feature-nodes are distinguished in the output for visualization purposes.  If multiple terms are specified, nodes are filtered to the strongest (ie, primary) term-feature relationships (to remove potential duplicates).  

**Edges** include the `n`-highest scoring term-feature associations for specified terms, as well as the `n` most frequent node-node associations per node (term & feature).  


```{r}
network <- lexvarsdatr::lvdr_extract_network (tfm = tcm_ppmi, 
                                              target = toupper(c('enemy', 'ally', 
                                                                 'friend', 'partner')),
                                              n = 15)
```


**Quick note**: Algorithms like `GloVe`, `SVD` & `word2vec` abstract over the term-feature associations that underlie (distributionally-derived) semantic relationships.  Visualizing the network structure of semantically related terms based in actual co-occurrence can help shed light on the sources of relatedness in ways that, eg, latent dimensions cannot.

**The plot below** illustrates the network structure (based on the PPMI term-feature matrix for the SOTU corpus) for a set of semantically related terms: ENEMY, ALLY, FRIEND, and PARTNER.  Terms are identified as triangles; features as circles. Color is used to specify primary term-feature relationships. Circle size specifies the (relative) strength of association between primary term and feature.

```{r fig.height=7, message=FALSE, warning=FALSE}
set.seed(66)
network %>%
  tidygraph::as_tbl_graph() %>%
  ggraph::ggraph() +
  
  ggraph::geom_edge_link(color = 'darkgray') + 
  ggraph::geom_node_point(aes(size = value, 
                              color = term,
                              shape = group)) +
  
  ggraph::geom_node_text(aes(label = toupper(label), 
                             filter = group == 'term'), 
                             repel = TRUE, size = 4) +
  
  ggraph::geom_node_text(aes(label = tolower(label), 
                             filter = group == 'feature'), 
                             repel = TRUE, size = 3) +
  ggthemes::scale_color_stata()+
  ggtitle('sotu co-occurrence network') +
  theme(legend.position = "none")
```


**Another take** using the word association data set, `lvdr_association_sparse`:


```{r fig.height=7, message=FALSE, warning=FALSE}
network2 <- lexvarsdatr::lvdr_extract_network(
  tfm = lexvarsdatr::lvdr_association_sparse,
  target = toupper(c('enemy', 'ally', 
                     'friend', 'partner')), 
  n = 15)

set.seed(11)
network2 %>%
  tidygraph::as_tbl_graph() %>%
  ggraph::ggraph() +
  
  ggraph::geom_edge_link(color = 'darkgray') + #alpha = 0.8
  ggraph::geom_node_point(aes(size = value, 
                              color = term,
                              shape = group)) +
  
  ggraph::geom_node_text(aes(label = toupper(label), 
                             filter = group == 'term'), 
                             repel = TRUE, size = 4) +
  
  ggraph::geom_node_text(aes(label = tolower(label), 
                             filter = group == 'feature'), 
                             repel = TRUE, size = 3) +
  ggthemes::scale_color_stata()+
  ggtitle('word association norms network') +
  theme(legend.position = "none")
```