Lexicon Mining, Language Visualization and Semiotic Squares in Python

February 21, 2018 Talk to the Puget Sound Python Programming Group

Please see Kessler-Puppy-2018-02-21.pptx for some introductory slides, and a brief survey of psychological literature on the importance of function words in lexicon mining.

The two notebooks used are written in Python 3.6. Please run

$ pip install scattertext spacy gensim

before using them.

The first notebook, Class-Association-Scores.ipynb, demonstrates a how to use Scattertext to visualize term-category assocations. The notebook will motivate and introduce the "Fightin' Words" formula-- the Log-Odds-Ratio with an Informative Dirichlet Prior (Monroe et al. 2008). The notebook goes on to discuss Scaled F-Score and the Dense Rank Difference. Data will be used from Pang et al., 2002.

The second notebook, Explore-Headlines.ipynb, shows how to use Scattertext to visualize the interactions between a number of document categories. The example used will be headlines posted to Facebook accounts from a variety of publishers in 2016. The data is taken verbatim from Max Woolfe's data set, available at https://github.com/minimaxir/clickbait-cluster under the MIT license.

I've included a notebook exploring toxic comment classification from a recent Kaggle competition: Toxic-Comments.

References

Cindy K. Chung and James W. Pennebaker. 2012. Counting Little Words in Big Data: The Psychology of Communities, Culture, and History. EASP.
Susan C. Herring, Anna Martinson. 2004. Assessing Gender Authenticity in Computer-Mediated Language Use: Evidence From an Identity Game. Journal of Language and Social Psychology.
Dan Jurafsky, Victor Chahuneau, Bryan Routledge, and Noah Smith. Narrative framing of consumer sentiment in online restaurant reviews. First Monday. 2014.
Jason S. Kessler. 2017. Scattertext: a Browser-Based Tool for Visualizing how Corpora Differ. ACL System Demonstrations.
McInnes, L, Healy, J, UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction, ArXiv e-prints 1802.03426, 2018.
Burt L. Monroe, Michael P. Colaresi, and Kevin M. Quinn. 2008. Fightin’ words: Lexical feature selection and evaluation for identifying the content of political conflict. Political Analysis.
Newman, ML; Groom, CJ; Handelman LD, Pennebaker, JW. Gender Differences in Language Use: An Analysis of 14,000 Text Samples. 2008.
Bo Pang, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up? Sentiment Classification using Machine Learning Techniques, EMNLP.
James W. Pennebaker, Carla J. Groom, Daniel Loew, James M. Dabbs. 2004. Testosterone as a Social Inhibitor: Two Case Studies of the Effect of Testosterone Treatment on Language. J Abnorm Psychol.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
notebooks		notebooks
toxic-comments		toxic-comments
Kessler-Puppy-2018-02-21.pptx		Kessler-Puppy-2018-02-21.pptx
README.md		README.md
~$Kessler-Puppy-2018-02-21.pptx		~$Kessler-Puppy-2018-02-21.pptx

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

toxic-comments

toxic-comments

Kessler-Puppy-2018-02-21.pptx

Kessler-Puppy-2018-02-21.pptx

README.md

README.md

~$Kessler-Puppy-2018-02-21.pptx

~$Kessler-Puppy-2018-02-21.pptx

Repository files navigation

Lexicon Mining, Language Visualization and Semiotic Squares in Python

References

About

Releases

Packages

Languages

JasonKessler/PuPPyTalk

Folders and files

Latest commit

History

Repository files navigation

Lexicon Mining, Language Visualization and Semiotic Squares in Python

References

About

Resources

Stars

Watchers

Forks

Languages