GitHub - h2researchgroup/classification: This code takes in JSTOR OCR raw text and expert-generated dictionaries, creates predictive models based on hand-coded training data (per perspective), uses these to predict remaining doc labels, and visualizes overall trends.

Towards Computational Literature Reviews

A collaborative data science project headed by the Massive Data Institute's Jaren Haber and UC Berkeley professor Heather Haveman.

Table of Contents

About The Project
Guide to Codebase
Data And Data Processing
Model Training
Analysis
Acknowledgments

About The Project

Literature reviews are a vital part of research in many academic fields, and serve to help determine what we know and don’t know. For interdisciplinary studies, literature reviews are especially challenging because the numerous publications and publication outlets are growing exponentially—critically, growing to a volume that a single person cannot easily manage. Our solution is to explore the capabilities of modern machine learning techniques to review the literature automatically at a large scale.

This code uses raw JSTOR OCR raw text, and creates predictive models based on hand-coded training data (per perspective) built on pretrained transformer models. We will then use these models to predict labels for papers at a large scale, and visualize overall trends across different perspectives over time.

(back to top)

Guide to Codebase

Transformer-Based Approaches

Scripts to train a Longformer Model on Azure:
- modeling/Azure Files/Longformer-CV.py
- modeling/Azure Files/run.ipynb
Notebooks to interactively train a BERT Model:
- modeling/BERT-cross-validate.ipynb
Notebook to label all 65k unlabeled articles using a Longformer model:
- modeling/Longformer-Labeling.ipynb
Text preprocessing code (updated to retain stopwords for transformers):
- preprocess/preprocess_article_text_full.py
- preprocess/textprocess.ipynb

Machine Learning with `scikit-learn`, `keras`, and `gensim`

Notebooks to build, evaluate, and optimize models with scikit-learn:
- modeling/classifier_gridsearch.ipynb
- modeling/evaluate_basic_classifiers_balanced.ipynb
Notebook to build and evaluate CNN and MLP models with keras:
- modeling/mlp_train.py
Notebook to build and evaluate models with scikit-learn and word embeddings:
- modeling/word_embedding_classification_cnn.ipynb
Notebook to build and evaluate CNN and MLP models with keras and word embeddings:
- modeling/word_embedding_classification_mean.ipynb

Utilities

Notebook to select and compile sample of articles across classes using model predictions:
- modeling/sample_articles_with_models.ipynb
Scripts with functions to assist in reading text files and text preprocessing:
- preprocess/clean_text.py
- preprocess/text_to_file.py
Notebook to load and merge datasets to assemble datasets for each perspective:
- preprocess/assemble_coded_articles.ipynb
Notebook creating a csv of our file names:
- modeling/grab filenames.ipynb
CSV Logs of model hyperparameter and results for each perspective using the average word embedding as feature vector:

(back to top)

Data And Data Processing

The data is obtained from approximately 70,000 JSTOR academic articles focusing on (1) sociology and (2) management and organizational behavior. Articles that are not full, in English-language, or book reviews are excluded. Our models are then trained and evaluated on hundreds of hand-labeled articles for each sociological perspective.

We preprocessed the text minimally. Because BERT utilizes structures of the sentence, stop words and grammatical structures become important as BERT assigns meaning to words based on their surrounding words in a process called self-attention. Thus, as part of the preprocessing step, we removed HTML and LaTeX tags in the JSTOR articles.

The script that processing the JSTOR files and saves the preprocessed files is in preprocess/preprocess_article_text_full.py.

(back to top)

Model Training

Our current training script uses the Longformer as implemented by the Hugging Face library. For a given sociological perspective, our model conducts cross-validation training and evaluation on the perspective's dataset of labeled papers. An interactive example of this code for BERT is found in modeling/BERT-cross-validate.ipynb.

To train our models, we took advantage of our student Microsoft Azure computation credits. We used the platform’s machine learning servers to search the space of hyperparameters for our model to maximize cross-validation accuracy, utilizing their powerful cloud GPUs and well-used hyperparameter search framework. The files used to run these experiments are found in modeling/Azure Files/.

We have about 700 training data for each perspective, and have about twice as much negative data as positive data. To address the issue of class imbalance in our data, we used oversampling. To oversample, we bootstrap from the minority class (label 1) so that the ratio of majority and minority class is 1:1. We then perform the same procedure on the test data.

(back to top)

Analysis

We first ran the models over ~65K unlabeled JSTOR articles from 1970 to 2016 to obtain the predicted probability that each article is of the 4 perspectives. Then to start our analysis, we divided the articles into 2 primary subjects: Sociology and Management & Organizational Behavior, each having 3 perspectives (demographic, cultural, and relational). Sociology articles are filtered to be organizational sociology, where the predicted organizational score is greater than a threshold of 0.7.

To analyze trends in various perspectives, we calculated the proportion of articles belonging to a certain perspective and primary subject and obtained the line graph above. The year 1970 and 2016 are outliers because both only contain less than 15 articles, resulting in the sharp fluctuations in some perspectives in 1970 and 2016.

Overall, we see that demographic perspective is the most common in sociology articles, while relational perspective is the most common in management & OB articles. Demographic management articles seems to be in gradual decline starting in 2010, and cultural management articles seems to be in gradual growth starting in 1970. All other categories seem to have fluctuations over the years but exhibit no general growth nor decline from 1970 to 2016.

Acknowledgments

(back to top)

Name		Name	Last commit message	Last commit date
Latest commit History 154 Commits
data		data
figures		figures
modeling		modeling
models		models
predictions		predictions
preprocess		preprocess
sample_generation		sample_generation
samples		samples
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

figures

figures

modeling

modeling

models

models

predictions

predictions

preprocess

preprocess

sample_generation

sample_generation

samples

samples

LICENSE

LICENSE

README.md

README.md

Repository files navigation

Towards Computational Literature Reviews

About The Project

Guide to Codebase

Transformer-Based Approaches

Machine Learning with `scikit-learn`, `keras`, and `gensim`

Utilities

Data And Data Processing

Model Training

Analysis

Acknowledgments

About

Releases

Packages

Contributors 3

Languages

License

h2researchgroup/classification

Folders and files

Latest commit

History

Repository files navigation

Towards Computational Literature Reviews

About The Project

Guide to Codebase

Transformer-Based Approaches

Machine Learning with scikit-learn, keras, and gensim

Utilities

Data And Data Processing

Model Training

Analysis

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Languages

Machine Learning with `scikit-learn`, `keras`, and `gensim`