Skip to content

A Generative Probabilistic Model for NLP

License

Notifications You must be signed in to change notification settings

inurutdinov/eaa

Repository files navigation

Endogenous Attention Allocation

This repository contains the description and implementation of Endogenous Attention Allocation (EAA), a generative probabilistic model for natural language processing (NLP). I developed this model as part of my master's thesis titled Strategic Issue Selection and Ideological Polarization: Evidence from the Congressional Record Data and defended at the New Economic School, Moscow in 2013.

Essentially, EAA modifies the Latent Dirichlet Allocation (LDA) model by incorporating document-level features. On one hand, such features can be used for improving the interpretability of the resulting topics. On the other, they can be of substantive interest, e.g., if one wants to understand the variation of topics across individuals and over time. The model uses nonconjugate priors. The optimization procedure relies on variational approximations and yields empirical Bayes estimates of the parameters. The code uses combination of Python and Cython, resulting in relatively fast performance. I thank Radim Řehůřek, whose gensim library provided inspiration for certain parts of the code.

I wrote the thesis during 2012-2013. EAA was developed independently from what later became known as the Structural Topic Model, a related approach that also builds on LDA. My primary goal was to understand when and why legislators in the U.S. Congress prioritized certain issues over others, as reflected in their speeches on the House or Senate floor. I conceptualized issue selection as a discrete choice problem in a random utility setting, with substantive political issues roughly corresponding to topics of speech transcripts. The model was trained and validated using speech transcripts from the Congressional Record pertaining to the 110th Congress (2007-2008). A slightly edited version of the thesis, which describes the algorithm itself, the setting, the data, and the main results, accompanies the code (eaa.pdf). The repository also contains additional scripts for preprocessing and merging the raw data.

I stopped developing EAA in 2013. The algorithm can be executed using the Congressional Record as input data by running the script run_eaa.py (make sure to edit the paths and parameters in user_config.py as necessary). Note: if you use the modern versions of the respective Python libraries, the code may require some modifications.