bescea: Instant text search engine

License: MIT

In a matter of a few minutes, build a quick, smart Shiny app to search through your own text data. Ideally suited to search through a set of short comments for a query or theme. Input data should be an R data frame with one id column and one text column.

Installation

devtools::install_github("harryahlas/bescea")

If you have not used R's reticulate package, please see the requirements section below prior to installing.

Run bescea

library(bescea)
besceaApp(data = sneapsters,         # Data frame, each document is a row/observation.
          text_field = "post_text",  # Text field from data frame
          unique_id = "textid")      # Unique identifier from data frame

Shiny App

The code above first tokenizes your text using SpaCy, then builds FastText and Whoosh models. Finally, it generates a Shiny app in your browser to search your text. You can download an .xlsx file of your results by clicking the Download button.

Move the Smart Search - Exact Match slider to fine tune between FastText smart searching and Whoosh exact matching. This can dial in your results more precisely depending on whether you are looking for general ideas or exact words.

Longer queries tend to be more successful than short queries. If a query like "data science" isn't working, try adding supplemental words to your query, such as "data science statistics code analytics".

Build Model

You have the option to run and save your own fastText model for use with other searches. This will speed up runtimes and often improve results. This example uses the sneapsters data set to build a model and search.

besceaBuildModel(data = sneapsters, 
                 text_field = "post_text",
                 unique_id = "textid", 
                 min_word_count = 3,              # Only consider tokens with at least n occurrences in the corpus
                 epochs = 30,                     # Number of fasttext epochs. More is generally better.
                 modelname = "my_fasttext_model") # Your model name, to be referred to when loading new data

Run Using Prior Model

You also have the ability to use a model that you have already built, perhaps one based on a large corpus. The example below loads the model from the example above, saving time as the app does not have to run a brand new model.

besceaApp(data = sneapsters, 
          text_field = "thread_text",
          unique_id = "textid",
          modelname = "my_fasttext_model")  # Name of already built model

Requirements

Requires RStudio (reticulate and tidyverse packages) and Python (pandas, re, spacy 2.3 [3+ will not work], rank_bm25, tqdm, pickle, numpy, gensim, and nmslib modules). Also requires that SpaCy's en_core_web_sm model be installed.

Acknowledgements

This work is inspired by articles from Josh Taylor (https://twitter.com/josh_taylor_01).

Name		Name	Last commit message	Last commit date
Latest commit History 91 Commits
R		R
data		data
inst/python		inst/python
man		man
models		models
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
LICENSE		LICENSE
NAMESPACE		NAMESPACE
README.md		README.md
bescea.Rproj		bescea.Rproj
python_setup.R		python_setup.R
real_test_scripts.R		real_test_scripts.R
useful_test_scripts.R		useful_test_scripts.R

License

harryahlas/bescea

Folders and files

Latest commit

History

Repository files navigation

bescea: Instant text search engine

Installation

Run bescea

Shiny App

Build Model

Run Using Prior Model

Requirements

Acknowledgements

About

Resources

License

Stars

Watchers

Forks

Languages