As a final project for a Data Visualization class, we analyzed senior theses.
Past senior theses were obtained from the Tufts Digital Library archive. They were divided into subject (Biology, Psychology, Economics, English, and International Relations).
In general, theses appear to have similar structure across department, but disciplines differ in magnitude. That is, Biology and Economics theses use longer words than average, and IR and English theses are much longer than average thesis from Tufts.
To see visualizations and further results, navigate to http://denalirao.github.io/thesis-Visualizations/ .
thesis_parser.py is the script that was used to extract text from the theses.