Skip to content

arsena-k/discourse_atoms

Repository files navigation

discourse_atoms

GitHub repository to accompany research article "Integrating topic modeling and word embedding to characterize violent deaths" by Alina Arseniev-Koehler, Susan Cochran, Vickie Mays, Kai-Wei Chang, and Jacob Gates Foster. Published in PNAS: https://www.pnas.org/doi/10.1073/pnas.2108801119. Please cite this paper if any code is reused. Code written in Python 3 in Windows.

Paper Abstract: There is an escalating need for methods to identify latent patterns in text data from many domains. We introduce a method to identify topics in a corpus and represent documents as topic sequences. Discourse atom topic modeling (DATM) draws on advances in theoretical machine learning to integrate topic modeling and word embedding, capitalizing on their distinct capabilities. We first identify a set of vectors (“discourse atoms”) that provide a sparse representation of an embedding space. Discourse atoms can be interpreted as latent topics; through a generative model, atoms map onto distributions over words. We can also infer the topic that generated a sequence of words. We illustrate our method with a prominent example of underutilized text: the US National Violent Death Reporting System (NVDRS). The NVDRS summarizes violent death incidents with structured variables and unstructured narratives. We identify 225 latent topics in the narratives (e.g., preparation for death and physical aggression); many of these topics are not captured by existing structured variables. Motivated by known patterns in suicide and homicide by gender and recent research on gender biases in semantic space, we identify the gender bias of our topics (e.g., a topic about pain medication is feminine). We then compare the gender bias of topics to their prevalence in narratives of female versus male victims. Results provide a detailed quantitative picture of reporting about lethal violence and its gendered nature. Our method offers a flexible and broadly applicable approach to model topics in text data.

The Discourse Atom Topic Model builds directly on a generative model for word embeddings themselves, proposed by Sanjeev Arora and colleagues:

  • Arora, Sanjeev, et al. "A latent variable model approach to pmi-based word embeddings." Transactions of the Association for Computational Linguistics 4 (2016): 385-399.
  • Arora, Sanjeev, Yingyu Liang, and Tengyu Ma. "A simple but tough-to-beat baseline for sentence embeddings." (2016).
  • Arora, Sanjeev, et al. "Linear algebraic structure of word senses, with applications to polysemy." Transactions of the Association for Computational Linguistics 6 (2018): 483-495.

Code in this repository (in development) shows our methods to implement the discourse atom topic model.

We cannot share the data used in this paper, but it is available from the Center for Disease Control (CDC). Users may apply to the CDC directly for data access: https://www.cdc.gov/violenceprevention/datasources/nvdrs/dataaccess.html. Prior to performing our analyses, the narrative fields in data (from law enforcement and coroners/medical examiners) were combined and subjected to extensive text cleaning (removal of punctuation, spelling corrections, conversion of abbreviations to full text) for better uniformity of the corpus over the multitude of public health workers' stylistic choices of expression.