GitHub - jaytimm/corpuslingr: A library of functions enabling complex corpus search in context (KWIC), search aggregation, bag-of-words building & keyphrase extraction.

corpuslingr: some corpus linguistics in r

A library of functions that streamlines two sets of common text-corpus tasks:

annotated corpus search of grammatical constructions and complex lexical patterns in context, and
detailed summary and aggregation of corpus search results.

search

Grammatical constructions and complex lexical patterns are formalized here (in terms of an annotated corpus) as patterns comprised of:

different types of elements (eg, form, lemma, or part-of-speech),
contiguous and/or non-contiguous elements,
positionally fixed and/or free (ie, optional) elements, or
any combination thereof.

Under the hood, corpuslingr search is regex-based & (informally) tuple-based --- akin to the RegexpParser function in Python's Natural Language Toolkit (NLTK). Regex character matching is streamlined with a simple "corpus querying language" modeled after the more intuitive and transparent syntax used in the online BYU suite of English corpora. This allows for convenient specification of search patterns comprised of form, lemma, & pos, with all of the functionality of regex metacharacters and repetition quantifiers.

At present, part-of-speech search is based on English-specific part-of-speech tags. In theory, search functionality could be made more language-generic by utilizing universal part-of-speech tags when building tuples. However, language-specific search will utlimately be more powerful/insightful.

summary

Summary functions allow users to:

aggregate search results by text & token frequency,
view search results in context (kwic),
create word embeddings/co-occurrence vectors for each search term, and
specify how search results are aggregated.

Importantly, both search and aggregation functions can be easily applied to multiple (ie, any number of) search queries.

utility

While still in development, the package should be useful to linguists and digital humanists interested in having BYU corpora-like search & summary functionality when working with (moderately-sized) personal corpora, as well as researchers interested in performing finer-grained, more qualitative analyses of language use and variation in context.

A simple shiny demo of search & summary functionaity is available here.

Here, we walk through a simple workflow from corpus creation using quicknews, corpus annotation using the cleanNLP package, and annotated corpus search & summary using corpuslingr.

library(tidyverse)
library(cleanNLP)
library(corpuslingr) #devtools::install_github("jaytimm/corpuslingr")
library(quicknews) #devtools::install_github("jaytimm/quicknews")
library(DT)

Corpus preparation & annotation

To demo the search functionality of corpuslingr, we first build a small corpus of current news articles using my quicknews package. We apply the gnews_get_meta/gnews_scrape_web functions across multiple Google News sections to build out the corpus some, and to add a genre-like dimension to the corpus.

topics <- c('nation','world', 'sports', 'science')

corpus <- lapply(topics, function (x) {
    quicknews::qnews_get_meta (language="en", country="us", type="topic", search=x)})%>%
  bind_rows() %>%
  quicknews::qnews_scrape_web ()

clr_prep_corpus

This function performs two tasks. It eliminates unnecessary whitespace from the text column of a corpus data frame object. Additionally, it attempts to trick annotators into treating hyphenated words as a single token. With the exception of Stanford's CoreNLP (via cleanNLP), annotators tend to treat hyphenated words as multiple word tokens. For linguists interested in word formation processes, eg, this is disappointing. There is likely a less hacky way to do this.

corpus <- clr_prep_corpus (corpus, hyphenate = TRUE)

Annotate via cleanNLP and udpipe

For demo purposes, we use udpipe (via cleanNLP) to annotate the corpus data frame object.

cleanNLP::cnlp_init_udpipe(model_name="english",feature_flag = FALSE, parser = "none") 
ann_corpus <- cleanNLP::cnlp_annotate(corpus$text, as_strings = TRUE, doc_ids = corpus$doc_id)

clr_set_corpus()

This function prepares the annotated corpus for complex search (as defined above) by building <token~lemma~pos> tuples and setting tuple onsets/offsets. Additionally, column names are homogenized using the naming conventions established in the spacyr package. Lastly, the function splits the corpus into a list of data frames by document. This is ultimately a search convenience.

Including text metadata in the meta parameter enables access to text characteristics when aggregating search results.

lingr_corpus <- ann_corpus$token %>%
  clr_set_corpus(doc_var='id', 
                  token_var='word', 
                  lemma_var='lemma', 
                  tag_var='pos', 
                  pos_var='upos',
                  sentence_var='sid',
                  meta = corpus[,c('doc_id','source','search')])

Some example tuple-ized text:

paste(lingr_corpus$corpus[[1]]$tup[200:204], collapse= " ")
## [1] "<intervene~intervene~VB> <during~during~IN> <a~a~DT> <Presidential~presidential~JJ> <foreign~foreign~JJ>"

clr_desc_corpus()

A simple function for describing an annotated corpus, providing some basic aggregate statistics at the corpus, genre, and text levels.

summary <- corpuslingr::clr_desc_corpus(lingr_corpus,doc="doc_id", 
                        sent="sentence_id", tok="token",upos='pos', genre="search")

Corpus summary:

summary$corpus
##    n_docs textLength textType textSent
## 1:     59      61110     8980     2640

By genre:

summary$genre
##           search n_docs textLength textType textSent
## 1:  topic_nation     11      12341     2965      563
## 2:  topic_sports     18      25469     4105     1207
## 3:   topic_world     16      14673     3562      609
## 4: topic_science     14       8627     2426      368

By text:

head(summary$text)
##    doc_id textLength textType textSent
## 1:      1       1124      471       44
## 2:      2       1592      586       83
## 3:      3        755      353       26
## 4:      4        171      110        8
## 5:      5        228      144       15
## 6:      6        317      183       22

Search & aggregation functions

Basic search syntax

The search syntax utilized here is modeled after the syntax implemented in the BYU suite of corpora. A full list of part-of-speech syntax can be viewed here.

library(knitr)
corpuslingr::clr_ref_search_egs %>% kable(escape=FALSE, format = "html")

type	search\_syntax	example
Simple form search	lime	lime
Simple lemma search	DRINK	drinks, drank, drinking
Lemma with POS search	BARK~VERB	barked, barking
Simple phrasal search	in the long run	in the long run
Phrasal search - POS/form	ADJ and ADJ	happy and healthy, political and economical
Phrasal search inc noun phrase	VERB NPHR into VBG	trick someone into believing
Phrasal search inc noun phrase	VERB PRP$ way PREP NPHR	make its way through the Senate
Suffix search	\*tion	defenestration, nation, retaliation
Infix search	\break\	breakable, heartbreaking
Optional search w/ parens and ?	MD (NEG)? HAVE been	should have been, might not have been
Multiple term search w parens and \|	PRON (HOPE\| WISH\| DESIRE)	He hoped, they wish
Multiple term search w parens and \|	House (Republicans\| Democrats)	House Republicans, House Democrats
Indeterminate wildcard search w brackets and min/max	NPHR BE (\*){1,4} ADJ	He was very, very happy; I'm not sure
Multiple optional search	(President)? (Bill)? Clinton	Clinton, President Clinton, Bill Clinton

clr_search_gramx()

Search for all instantiations of a particular lexical pattern/grammatical construction devoid of context. This function enables fairly quick search.

search1 <- "ADJ and (ADV)? ADJ"

lingr_corpus %>%
  corpuslingr::clr_search_gramx(search=search1)%>%
  select(doc_id, search, token, tag)%>% 
  slice(1:15)
## # A tibble: 15 x 4
##    doc_id search       token                       tag       
##    <chr>  <chr>        <chr>                       <chr>     
##  1 1      topic_nation healthy and prepared        JJ CC JJ  
##  2 1      topic_nation dedicated and outstanding   JJ CC JJ  
##  3 3      topic_world  harsh and brutal            JJ CC JJ  
##  4 3      topic_world  blind and deaf              JJ CC JJ  
##  5 12     topic_world  dangerous and unstable      JJ CC JJ  
##  6 12     topic_world  dangerous and unstable      JJ CC JJ  
##  7 13     topic_world  serious and irregular       JJ CC JJ  
##  8 16     topic_sports third and fourth            JJ CC JJ  
##  9 16     topic_sports second and third            JJ CC JJ  
## 10 16     topic_sports better and worse            JJR CC JJR
## 11 16     topic_sports straight and narrow         JJ CC JJ  
## 12 16     topic_sports frustrating and unavailable JJ CC JJ  
## 13 16     topic_sports effective and healthy       JJ CC JJ  
## 14 17     topic_sports second and third            JJ CC JJ  
## 15 18     topic_sports contrite and apologetic     JJ CC JJ

clr_get_freq()

A simple function for calculating text and token frequencies of search term(s). The agg_var parameter allows the user to specify how frequency counts are aggregated.

search2 <- "VERB into"

lingr_corpus %>%
  corpuslingr::clr_search_gramx(search=search2)%>%
  corpuslingr::clr_get_freq(agg_var = c('lemma'), toupper=TRUE)%>%
  head()
##         lemma txtf docf
## 1:  TURN INTO    6    3
## 2: CRASH INTO    3    1
## 3:  LOOK INTO    3    3
## 4: FORCE INTO    2    2
## 5:   GET INTO    2    2
## 6:    GO INTO    2    2

Having included metadata in the call to clr_set_corpus, we can aggregate search results, eg, by Google News topic:

search3 <- "SHOT~NOUN| BALL~NOUN| PLAY~VERB"

lingr_corpus %>%
  corpuslingr::clr_search_gramx(search=search3)%>%
  corpuslingr::clr_get_freq(agg_var = c('search','token','tag'), toupper=TRUE)%>%
  slice(1:15)
## # A tibble: 15 x 5
##    search        token   tag    txtf  docf
##    <chr>         <chr>   <chr> <int> <int>
##  1 topic_sports  SHOT    NN       25     2
##  2 topic_sports  PLAY    VB       21     8
##  3 topic_sports  SHOTS   NNS      15     2
##  4 topic_sports  BALL    NN       12     2
##  5 topic_sports  PLAYED  VBD      10     5
##  6 topic_sports  PLAYING VBG      10     4
##  7 topic_sports  PLAYED  VBN       9     7
##  8 topic_sports  PLAY    VBP       6     1
##  9 topic_sports  PLAYS   VBZ       6     4
## 10 topic_world   PLAYED  VBN       3     1
## 11 topic_nation  PLAYING VBG       1     1
## 12 topic_nation  PLAYS   VBZ       1     1
## 13 topic_science PLAY    VB        1     1
## 14 topic_science PLAYING VBG       1     1
## 15 topic_world   PLAY    VB        1     1

clr_search_context()

A function that returns search terms with user-specified left and right contexts (LW and RW). Output includes a list of two data frames: a BOW (bag-of-words) data frame object and a KWIC (keyword in context) data frame object.

Note that generic noun phrases can be included as a search term (regex below), and can be specified in the query using NPHR.

clr_ref_nounphrase
## [1] "(?:(?:DET )?(?:ADJ )*)?(?:((NOUNX )+|PRON ))"

search4 <- 'NPHR BE (NEG)? VBN'

found_egs <- corpuslingr::clr_search_context(search=search4,corp=lingr_corpus,LW=15, RW = 15)

clr_context_kwic()

Access KWIC object:

found_egs %>%
  corpuslingr::clr_context_kwic(include=c('search', 'source'))%>% 
  DT::datatable(selection="none",class = 'cell-border stripe', rownames = FALSE,width="100%", escape=FALSE)

clr_context_bow()

A function for accessing/aggregating BOW object. The parameters agg_var and content_only can be used to specify how collocates are aggregated and whether only content words are included, respectively.

search5 <- "White House"

corpuslingr::clr_search_context(search=search5,corp=lingr_corpus, LW=20, RW=20)%>%
  corpuslingr::clr_context_bow(content_only = TRUE, agg_var = c('searchLemma', 'lemma'))%>%
  head()
##    searchLemma    lemma cofreq
## 1: WHITE HOUSE  JACKSON     10
## 2: WHITE HOUSE      SAY      7
## 3: WHITE HOUSE    TRUMP      7
## 4: WHITE HOUSE DOCUMENT      3
## 5: WHITE HOUSE     JUST      3
## 6: WHITE HOUSE      KIM      3

clr_search_keyphrases()

Function for extracting key phrases from each text comprising a corpus based on tf-idf weights. The methods and logic underlying this function are described in more detail here.

The regex for key phrase search:

clr_ref_keyphrase
## [1] "(ADJ )*(NOUNX )+((PREP )(ADJ )*(NOUNX )+)?"

The user can specify the number of key phrases to extract, how to aggregate key phrases, how to output key phrases, and whether or not to use jitter to break ties among top n key phrases.

library(knitr)
lingr_corpus %>%
  corpuslingr::clr_search_keyphrases(n=5, 
                                     key_var ='lemma', 
                                     flatten=TRUE, 
                                     jitter=TRUE, 
                                     include = c('doc_id','search','source')) %>%
  slice(1:10) %>%
  kable(escape=FALSE, format = "html")

doc\_id	search	source	keyphrases
1	topic\_nation	abcnews.go.com	Jackson \| Secret Service \| allegation \| document \| Tester
2	topic\_sports	mmajunkie.com	Bellator \| MMA \| BMMA \| fight \| Danis
3	topic\_world	people.com	Otto \| complaint \| North Korea \| son \| family
4	topic\_sports	profootballtalk.nbcsports.com	Patriots \| Brown \| no. \| contribution \| pm EDT Getty
5	topic\_sports	profootballtalk.nbcsports.com	Bradford \| Rosen \| league \| pm \| injury history
6	topic\_sports	profootballtalk.nbcsports.com	Petty \| jet \| pm \| talent evaluator \| Namath
7	topic\_sports	profootballtalk.nbcsports.com	nuke \| only medium of expression \| Seahawks article \| existence \| football field
8	topic\_world	time.com	India \| Modi \| China \| Beijing \| world
9	topic\_world	bbc.com	chemical \| ban \| restriction \| be \| environment
10	topic\_world	bbc.com	Ms Rudd \| target \| Guardian \| home secretary \| memo

Name		Name	Last commit message	Last commit date
Latest commit History 478 Commits
R		R
data-raw		data-raw
data		data
man		man
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
NAMESPACE		NAMESPACE
README-unnamed-chunk-18-1.png		README-unnamed-chunk-18-1.png
README.Rmd		README.Rmd
README.md		README.md
corpuslingr.Rproj		corpuslingr.Rproj

jaytimm/corpuslingr

Folders and files

Latest commit

History

Repository files navigation