Skip to content

zackgalbreath/MailingListData

Repository files navigation

* Author: Zack Galbreath
* Software used: Mac OS 10.6, Python 2.6.1, SQLite 3.6.12, PycURL 7.19.0,
                 BeautifulSoap 3.2.0, Pipermail 0.09 (Mailman edition).
* Project home: git://github.com/zackgalbreath/MailingListData.git
* MD5 checksum of the "official" database file is:
  c17555700f3c9f0a019be6a537f83bc6 


== Contents ==

README: The document you're currently reading
MailingListData.{db,csv,xlsx}: My dataset, in SQLite, comma-separated-values,
                               and Microsoft Excel 2007-2008 format
collectData.py: Generates the dataset and stores it in an SQLite database
convertSQLiteToCSV.py: Converts the SQLite database to CSV format


== Usage notes ==

* Running collectData.py will overwrite an existing MailingListData.db
  If you modify the database and wish to keep your changes, you should rename
  it or save it somewhere else.

* Similarly, convertSQLiteToCSV.py overwrites MailingListData.csv.


== Details about the columns == 

* When recording Message_Subject, any tab character was converted to a single
  space. 

* A Received_Reply of "self only" means that the original author was the only
  person to reply to the message.

* Time_of_Day is recorded in EDT using a 24 hour clock.

* Message_Length is the number of characters in the HTML source of Archive_URL.

* Any_Attachments is set to "yes" when we detect a "non-text" attachment that is
  not a "pgp-signature".  Note that Pipermail considers C++ source code
  (MIME-type text/x-c++src) to be a "non-text" attachment.

* You should be able to read each original email in its entirety at its
  Archive_URL.  This list of URLs was taken from: 
  http://www.itk.org/pipermail/insight-users/2011-August/thread.html

About

data science project fall 2011

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published