GitHub - nealcaren/quant-text-fall-2014

**Soci 950 - Words to Numbers: Quantitative Text Analysis **
Fall 2014
Hamilton Hall 150
Monday, Wednesday 1-2:15pm
Neal Caren neal.caren@unc.edu

There's nothing new about sociologists using text as data. Traditionally, source materials have come from things like primary sources (e.g. interviews) or secondary sources (e.g. newspapers). Scholars use programs like NVivo or ATLAS.ti to assist them in making sense of the data. Over the last decade, however, researchers from a variety of disciplines have increasingly turned to a more algorithmic analysis of texts. This new focus on the quantitative analysis of text (along with network analysis and agent based modeling) forms the basis of computational social science. Not coincidentally, the ability of social scientists to collect text corpuses has also grown over the last decade. This combination of new methods and new sources of data presents a unique opportunity for social scientists to find new answers to old questions and start asking new questions.

The primary learning goal for this course if for students to develop the ability to employ appropriate quantitative textual analysis techniques to a social scientific question. In other words, you should be able to write a publishable paper that involves the quantitative analysis of text. Specifically, I expect that by the end of the semester you will be:

Able to collect, store and manipulate data from text files, web pages, and web application programming interfaces (APIs);
Familiar with the major methods of text analysis;
Knowledgeable of relevant machine learning techniques;
Able to apply relevant analytic methods to appropriate social scientific questions.

Between most class meetings you will have to do some combinations of the following things. First, you will be reading contemporary examples of social scientific research that employs the relevant methods. Second, you will be reading code of worked examples. Quite often, this code will be in the form of IPython notebooks. Third, you will have to produce some code yourself. For the first few days, this code will take advantage of the codecademy Python MOOC. After that, you'll be writing your own code that you will either bring to class or email to me ahead of class. Finally, we'll spend the last section of the course working on a pair of studies. For each, you and your partner are responsible for presenting your code and findings.

Half of you grade will be based on the daily homework. They are marked with a H on the syllabus. The other half of your grade is based on the two projects. For each project, you will be expected to present your findings and hand in a well-commented IPython notebook so that someone can replicate your findings. The first project involves an analysis of political emails. The second involves contemporary newspaper data. In both cases, you will develop an interesting sociological puzzle, get/collect the appropriate raw data, and then analyze the data to explore your puzzle. You may work with a partner, but you can't have the same partner. Depending on the flow of the course and student interest, we might end up doing something else for one or both of these projects.

I've put together a list of online Python tutorials that are accessible to social scientists. You might find some helpful when looking for additional information about a topic.

Wednesday 8/20 - Introductions

Monday, 8/25 - The Basics

Read The Need for Openness in Data Journalism by Brian Keegan
Download and install the Continuum Python distribution.
Codecademy. Sign up for the Python course. Do the first six lessons -- up to, and including, PygLatin. Note: Feel free to skip the Boolean Operators section.
H Email me your "Great job finishing" certificate from codecademy.

Wednesday, 8/27 - More Python

Read Light, Ryan. "From Words to Networks and Back: Digital Text, Computational Social Science, and the Case of Presidential Inaugural Addresses." Social Currents (2014)
Codecademy. Next two sections: Functions (Functions & Taking a Vacation) and Lists & Dictionaries (Python Lists and Dictionaries and A Day at the Supermarket).
H - Email me your "Great job finishing" certificate from codecademy.

Wednesday, 9/3 - Even more Python

Codecademy. A few more sections: Lists and Functions (Lists and Functions and Battleship!); Loops (Loops and Practice makes Perfect); Exam Statistics; and Advanced Topics in Python (feel free to skip the Lambdas section).
H - Email me your "Great job finishing" certificate from codecademy.

Monday, 9/8 - Getting Data when they want to give it to you

Codecademy. Take the Placekitten API course. Then, pick another API class that uses Python, such as NPR, Sunlight Foundation, or NHTSA. Take that one.
H - Email me your "Great job finishing" certificate from codecademy.

Wednesday, 9/10 - More on APIs

Sushi Bars and Yelp. Sign up to be a developer on Yelp.
H - Bring an IPython notebook to class that lists the coffee bars in your home town. Feel free to copy and paste from me. Note: You might need to install oauth2.

Monday, 9/15 - Twitter API

Mining Twitter.
Get your own consumer key, consumer secret, access token, and access token secret from Twitter.
H - Bring an IPython notebook to class that does something cool with tweepy.

Wednesday, 9/17 - Getting Data when they may not want to give it to you

Web Scraping in Python.
H - Bring an IPython notebook to class that downloads a web page.

Monday, 9/22 - More web scraping

Pinterest Analysis about the NYFW Fall 2014
H - Bring in an IPython notebook that scrapes the UNC sociology faculty emails.

Wednesday, 9/24 - Counting words

Read Ryan C. Black, Sarah A. Treul, Timothy R. Johnson, and Jerry Goldman. Emotions, oral arguments, and Supreme Court decision making. The Journal of Politics, 73(2):572–581, April 2011.
Read Monroe, Burt, Michael Colaresi, and Kevin Quinn. 2008. “Fightin’ Words: Lexical Feature Selection and Evaluation for Identifying the Content of Political Conflict”. Political Analysis 16(4)
Using Python to see how the Times writes about men and women
H - Bring an IPython notebook to class that loads a corpus and tokenizes the texts.

Monday, 9/29 - Sentiment Analysis

Read Golder, Scott A., and Michael W. Macy. "Diurnal and seasonal mood vary with work, sleep, and daylength across diverse cultures." Science 333.6051 (2011): 1878-1881.
Read Dodds, Peter and Christopher Danforth. 2009. “Measuring the Happiness of Large- Scale Written Expression: Songs, Blogs, and Presidents”. Journal of Happiness Studies 11, 4. 441-456
Sentiment analysis
H - Bring an IPython notebook with a function that returns the sentiment of a text.

Wednesday, 10/1 - Classification

Basic principles of machine learning
Predicting customer churn with scikit-learn
Text Classification with Naïve Bayes
H - Bring an IPython notebook that predicts text categories based on given dataset.

Monday, 10/6 - Feature Selection

Testing and Validation in Scikit-Learn
Machine Learning with Scikit-Learn: Validation and Model Selection
H - Bring an IPython notebook that improves on your classification machine using validation methods.

Wednesday, 10/8 - Machine Learning Algorithms

Read D’Orazio, Vito, et al. "Separating the Wheat from the Chaff: Applications of Automated Document Classification Using Support Vector Machines." Political Analysis 22.2 (2014): 224-242.
Supervised Learning In-Depth: SVMs and Random Forests
Basic Random Forest Model
H - Bring an IPython notebook that improves on your classification machine using different classification algorithms.

Monday, 10/13 - Review

Wednesday, 10/15 - Review

Monday, 10/20 - Text clustering

https://de.dariah.eu/tatom/working_with_text.html
Unsupervised Learning In-depth: PCA and K-Means
H - Text clustering homework.

Wednesday, 10/22 - Topic Models in Mallet

Monday, 10/27 - Topic Models in Python

Wednesday, 10/29 - Project I - Analzying Political Discourse

Monday, 11/3

Read Bail, Christopher A. "The Fringe Effect Civil Society Organizations and the Evolution of Media Discourse about Islam since the September 11th Attacks." American Sociological Review 77.6 (2012): 855-879.

Wednesday, 11/5

Read Spirling, Arthur. "US treaty making with American Indians: Institutional change and relative power, 1784–1911." American Journal of Political Science 56.1 (2012): 84-97.

Monday, 11/10

Read Grimmer, Justin, and Brandon M. Stewart. "Text as data: The promise and pitfalls of automatic content analysis methods for political texts." Political Analysis (2013): mps028.

Wednesday, 11/12

Student presentations

Monday, 11/17 - Project II - Realtime Sociology

Install storytracker.

Wednesday, 11/19

Read Grimmer, Justin. "Measuring Representational Style in the House: The Tea Party, Obama and Legislators’ Changing Expressed Priorities." (2014).

Monday, 11/24

Read Levy, Karen EC, and Michael Franklin. "Driving Regulation: Using Topic Models to Examine Political Contention in the US Trucking Industry." Social Science Computer Review (2013): 0894439313506847.

Monday, 12/1

TBA

Wednesday, 12/3

Student presentations

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.ipynb_checkpoints		.ipynb_checkpoints
times_data		times_data
Gender_Cites.ipynb		Gender_Cites.ipynb
Gender_Cites_Pandas.ipynb		Gender_Cites_Pandas.ipynb
README.md		README.md
Topic_Modeling_Options.ipynb		Topic_Modeling_Options.ipynb
day_10.ipynb		day_10.ipynb
day_11.ipynb		day_11.ipynb
day_12.ipynb		day_12.ipynb
day_13.ipynb		day_13.ipynb
day_15.ipynb		day_15.ipynb
day_2.ipynb		day_2.ipynb
day_2_answers.ipynb		day_2_answers.ipynb
day_3.ipynb		day_3.ipynb
day_3_answers.ipynb		day_3_answers.ipynb
day_4.ipynb		day_4.ipynb
day_4_answers.ipynb		day_4_answers.ipynb
day_5.ipynb		day_5.ipynb
day_5_answers.ipynb		day_5_answers.ipynb
day_6.ipynb		day_6.ipynb
day_7.ipynb		day_7.ipynb
day_8.ipynb		day_8.ipynb
donor_emails.txt		donor_emails.txt
emails.json		emails.json
liwc.py		liwc.py
liwc_sample.ipynb		liwc_sample.ipynb
negative.csv		negative.csv
npr.ipynb		npr.ipynb
pandas_titanic.ipynb		pandas_titanic.ipynb
positive.csv		positive.csv
read.md		read.md
soc_articles_details.tsv		soc_articles_details.tsv
stop_words.ipynb		stop_words.ipynb
stopwords.pickle		stopwords.pickle
times_abortion.json		times_abortion.json
topic_modeling.ipynb		topic_modeling.ipynb
topic_modeling_in_python.ipynb		topic_modeling_in_python.ipynb

nealcaren/quant-text-fall-2014

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages