-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
219 lines (140 loc) · 9.48 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
---
output:
md_document:
variant: markdown_github
---
# lexvarsdatr
An R package: some tools for investigating lexical variation from both behavioral and distributional perspectives. Including:
1. A collection of psycholinguistic/behavioral data sets, &
2. A few functions for extracting semantic associations and network structures from term-feature matrices.
## Installation
```{r eval = FALSE, message=FALSE, warning=FALSE, paged.print=FALSE}
library(devtools)
devtools::install_github("jaytimm/lexvarsdatr")
library(lexvarsdatr)
```
## Usage
### Behavioral data
Behavioral data included in the package: Response times in lexical decision & naming, concreteness ratings, age-of-acquisition (AoA) ratings, and word association norms. Sources are presented below:
```{r echo=FALSE, message=FALSE, warning=FALSE}
Data <- c("Lexical decision and naming","Concreteness ratings","AoA ratings", "Word association")
Source <- c( "Balota, D. A., Yap, M. J., Hutchison, K. A., Cortese, M. J., Kessler, B., Loftis, B., ... & Treiman, R. (2007). The English lexicon project. *Behavior research methods*, 39(3), 445-459.","Brysbaert, M., Warriner, A. B., & Kuperman, V. (2014). Concreteness ratings for 40 thousand generally known English word lemmas. *Behavior research methods*, 46(3), 904-911.","Kuperman, V., Stadthagen-Gonzalez, H., & Brysbaert, M. (2012). Age-of-acquisition ratings for 30,000 English words. *Behavior Research Methods*, 44(4), 978-990.","Nelson, D. L., McEvoy, C. L., & Schreiber, T. A. (2004). The University of South Florida free association, rhyme, and word fragment norms. *Behavior Research Methods, Instruments, & Computers*, 36(3), 402-407.")
knitr::kable(data.frame(Data=Data, Source=Source))
```
Response times in lexical decision/naming, concreteness ratings, and AoA ratings have been collated into a single data frame, `lex_behav_data`. Approximately 18K word forms are included in all three data sets.
```{r message=FALSE, warning=FALSE}
library(tidyverse)
lexvarsdatr::lvdr_behav_data %>% na.omit %>% head()
```
The **South Florida word association data** can be accessed via `lvdr_association`. A description of variables included in the normed data set, as well as methodologies, can be found [here](http://w3.usf.edu/FreeAssociation/). Word association data is also available as as sparse matrix, `lvdr_association_sparse`.
### Functions
To demonstrate the utility of the functions included in the package, we first create a simple count-based term-feature co-occurrence matrix using US Presidential State of the Union (SOTU) addresses -- made available in TIF format via the `sotu` package. A fairly small corpus at ~2 million words.
Here, we work within the `text2vec` framework. Window size of co-occurrence, 5x5. For simplicity, we tokenize at the word-level.
```{r message=FALSE, warning=FALSE}
library(sotu)
t2v_ents <- text2vec::itoken(sotu::sotu_text,
preprocessor = toupper,
tokenizer = text2vec::word_tokenizer,
ids = 1:236)
vocab <- text2vec::create_vocabulary(t2v_ents,
stopwords = toupper(tm::stopwords()))
pruned_vocab <- text2vec::prune_vocabulary(
vocab, term_count_min = 10, doc_proportion_max = 0.95) %>%
filter(!grepl('[0-9]', term))
tcm <- text2vec::create_tcm(t2v_ents,
vectorizer = text2vec::vocab_vectorizer(pruned_vocab),
skip_grams_window = 5L,
skip_grams_window_context = "symmetric",
weight = c(1,1,1,1,1)) #No weight
```
### § Build PPMI Matrix
The `lvdr_calc_ppmi` function transforms a count-based co-occurrence matrix to a positive-pointwise mutual information matrix, modified from this [SO post](https://stackoverflow.com/questions/43354479/how-to-efficiently-calculate-ppmi-on-a-sparse-matrix-in-r).
```{r}
tcm_ppmi <- tcm %>%
lexvarsdatr::lvdr_calc_ppmi(make_symmetric = TRUE)
```
### § Get collocates, neighbors, etc.
The `lvdr_get_closest` function can be used to extract the `n` highest scoring features associated with a term (or set of terms) from a term-feature matrix. Assumes a column-oriented matrix (`dgCMatrix`) as input. `data.table` dependency. Modified from the `udpipe::as.cooccurrence()` function.
Per the SOTU PPMI co-occurrence matrix created above, we extract the ten strongest **collocates** of the term VIOLENCE. Output is a simple data frame.
```{r message=FALSE, warning=FALSE}
lexvarsdatr::lvdr_get_closest(tfm = tcm_ppmi,
#lexvarsdatr::lvdr_association_sparse,
target = 'VIOLENCE',
n = 10) %>%
knitr::kable(row.names = FALSE)
```
The function can also be used to extract **nearest neighbors** from a cosine similarity matrix. To demonstrate, we (1) consolidate feature set to 150 latent dimensions via singular-value decomposition, and then (2) construct cosine-based, term-term similarity matrix.
```{r}
tcm_svd <- irlba::irlba (tcm_ppmi, nv = 150)
tcm_svd1 <- as.matrix(data.matrix(tcm_svd$u))
dimnames(tcm_svd1) <- list(rownames(tcm_ppmi),
c(1:length(tcm_svd$d)))
# Create cosine similarity matrix
cos_sim <- text2vec::sim2(x = tcm_svd1,
method = 'cosine',
norm = 'l2')
```
Per matrix, we extract the five **nearest neighbors** (ie, ~synonyms) for the terms TARIFF and SCIENCE.
```{r message=FALSE, warning=FALSE}
#library(data.table)
lexvarsdatr::lvdr_get_closest(tfm = cos_sim,
target = c('TARIFF','SCIENCE'),
n = 5) %>%
knitr::kable(row.names = FALSE)
```
### § Build network structure
The `lvdr_extract_network` function extracts the network structure for a term (or set of terms) from a term-feature matrix (again, as `dgCMatrix`). The function is built on `lvdr_get_closest()`. Output is a list that includes a `node` data frame and an `edges` data frame, structured to play nice with the `tidygraph` and `ggraph` plotting paradigms.
The number of **nodes** (per term) to include in the network is specified by the `n` parameter, ie, the `n` highest scoring features associated with a term from a term-feature matrix. Term-nodes and feature-nodes are distinguished in the output for visualization purposes. If multiple terms are specified, nodes are filtered to the strongest (ie, primary) term-feature relationships (to remove potential duplicates).
**Edges** include the `n`-highest scoring term-feature associations for specified terms, as well as the `n` most frequent node-node associations per node (term & feature).
```{r}
network <- lexvarsdatr::lvdr_extract_network (tfm = tcm_ppmi,
target = toupper(c('enemy', 'ally',
'friend', 'partner')),
n = 15)
```
**Quick note**: Algorithms like `GloVe`, `SVD` & `word2vec` abstract over the term-feature associations that underlie (distributionally-derived) semantic relationships. Visualizing the network structure of semantically related terms based in actual co-occurrence can help shed light on the sources of relatedness in ways that, eg, latent dimensions cannot.
**The plot below** illustrates the network structure (based on the PPMI term-feature matrix for the SOTU corpus) for a set of semantically related terms: ENEMY, ALLY, FRIEND, and PARTNER. Terms are identified as triangles; features as circles. Color is used to specify primary term-feature relationships. Circle size specifies the (relative) strength of association between primary term and feature.
```{r fig.height=7, message=FALSE, warning=FALSE}
set.seed(66)
network %>%
tidygraph::as_tbl_graph() %>%
ggraph::ggraph() +
ggraph::geom_edge_link(color = 'darkgray') +
ggraph::geom_node_point(aes(size = value,
color = term,
shape = group)) +
ggraph::geom_node_text(aes(label = toupper(label),
filter = group == 'term'),
repel = TRUE, size = 4) +
ggraph::geom_node_text(aes(label = tolower(label),
filter = group == 'feature'),
repel = TRUE, size = 3) +
ggthemes::scale_color_stata()+
ggtitle('sotu co-occurrence network') +
theme(legend.position = "none")
```
**Another take** using the word association data set, `lvdr_association_sparse`:
```{r fig.height=7, message=FALSE, warning=FALSE}
network2 <- lexvarsdatr::lvdr_extract_network(
tfm = lexvarsdatr::lvdr_association_sparse,
target = toupper(c('enemy', 'ally',
'friend', 'partner')),
n = 15)
set.seed(11)
network2 %>%
tidygraph::as_tbl_graph() %>%
ggraph::ggraph() +
ggraph::geom_edge_link(color = 'darkgray') + #alpha = 0.8
ggraph::geom_node_point(aes(size = value,
color = term,
shape = group)) +
ggraph::geom_node_text(aes(label = toupper(label),
filter = group == 'term'),
repel = TRUE, size = 4) +
ggraph::geom_node_text(aes(label = tolower(label),
filter = group == 'feature'),
repel = TRUE, size = 3) +
ggthemes::scale_color_stata()+
ggtitle('word association norms network') +
theme(legend.position = "none")
```