MAD Topic Model

Wordless authorship recognition using topic models. MAD implements a multivalent supervised Latent Dirichlet Allocation algorithm that operates on vocabularies of n-gram stylistic and stylometric features, such as part-of-speech tags and syllable counts. MAD can be used for both author classification and exploratory analysis as the generated topic models reveal hidden structure among authors' writing styles.

Dependencies

See requirements.txt for a list of Python dependencies.

In addition, the following NLTK libraries are required for feature extraction:

cmudict
punkt
maxent_treebank_pos_tagger

Finally, the GSL library for C++ is necessary for running the actual MAD model.

Data

The data folder contains the final scraped datasets used in the paper.

For the scrapers' source code, refer to the scrapers folder. READMEs in each of the subfolders explain how to begin the programs. Project Gutenberg texts were downloaded manually because of the low number of authors needed for that portion of the project.

The slda_input_files folder contains files in a generalized format ready to be read into the SLDA model. Input files have been generated for several combinations of word types and values of $n$. These input files are generated with pipeline.py.

Models

The MAD model is implemented in models/slda. Our implementation is based on that of Chong Wang. The algorithm is mostly contained in slda.cpp, with help from opt.cpp and dirichlet.c for computing difficult gradients.

settings.h contains a list of settings, allowing for different inference schemes (variational, stochastic, and the original fixed point-variational combination), regularization (including L1 (not fully tested, but based on liblbfgs), L2, and smoothing for the Dirichlet MLE) and per-topic vocabulary distributions. settings.h also controls the number of iterations the algorithm should run. settings.h has difficulty reading the settings.txt input file, so it is best to hard code the desired settings into settings.h. A supplemental README is provided within models/slda with information on how to run the software from command line.

Feature Extraction

Python modules are provided for extracting the necessary stylistic features from text, such as part-of-speech tags and syllable counts. The majority of the feature extraction functions can be found in features/analyzer.py. In addition, a novel technique for extracting meter 8-grams is provided in features/meter.py. To allow for further analysis (after n-gram generation), features/extract.py provides functions for identifying text snippets within documents that match the given stylistic n-gram.

Name		Name	Last commit message	Last commit date
Latest commit History 229 Commits
data		data
features		features
gutenberg		gutenberg
hlda		hlda
liblbfgs-1.10		liblbfgs-1.10
models		models
pipeline		pipeline
poster		poster
report		report
scrapers		scrapers
slda_input_files		slda_input_files
visualization		visualization
.gitignore		.gitignore
README.md		README.md
converter.py		converter.py
pipeline.py		pipeline.py
requirements.txt		requirements.txt
test.py		test.py

dmrd/mad_topic_model

Folders and files

Latest commit

History

Repository files navigation

MAD Topic Model

Dependencies

Data

Models

Feature Extraction

About

Resources

Stars

Watchers

Forks

Languages