GitHub - sarahjanehong/birdy: Processing raw text from Twitter to perform feature extraction on tweets for classification using the WEKA machine learning package.

There are two parts to this project, twtt.py and buildarff.py

Preprocessing, tokenizing, and tagging parts-of-speech

twtt.py takes in unprocessed tweets and removes noise by

removing all HTML tags and attributes
convert HTML character codes to ASCII equivalent (e.g. "<" is changed to "<")
all URLs are removed
@ and # tags are removed
each sentence in each tweet in on a new line
each tweet is delimited with '|'
ellipsis (...) and combinations of ! and ? are not split (e.g. "?!?!?!?!???")
each token, including punctuation and clitics, is separated by a space (e.g. "we've" changed to "we 've")
tagging each token with its part-of-speech

To run twtt.py, from the terminal, type: python twtt.py TWEETS OUTPUT

where ...

TWEETS is the input file containing tweets (unprocessed) from Twitter from a particular user
OUTPUT is the output file (*.twt) where each tweet is parsed, tokenized and tagged

e.g. python justinbieber justinbieber.twt

Feature extraction

buildarff.py takes .twt files to build an ARFF datafile. The produced ARFF datafile gathers feature information that will be used for the classification of tweets.

The features computed for each .twt files are the number of

first person pronouns
second person pronouns
third person pronouns
coordinating conjunctions
past-tense verbs
future-tense verbs
commas
colons and semi-colons
dashes
parentheses
common nouns
proper nouns
adverbs
wh-words
slang
all-caps

as well as the average length of sentences for a tweet, the average length of tokens, and the number of sentences in a tweet.

The WEKA machine learning package will be used to classify tweets given the .arff files.

To run buildarff.py, from the terminal, type: python buildarff.py -X CLASS:FILE CLASS:FILE+FILE+FILE ... OUTPUT

where ...

X denotes the number of tweets to read; exclude this if you wish to read all the tweets from each FILE
CLASS denotes the class (optional, else takes the file name)
FILE is the parsed, tokenized and tagged file (.twt) produced from twtt.py; if there is more than one file for a given class, this is denoted by a + in between each file
OUTPUT the generated arff file (*.arff)

e.g. python buildarff.py pop:justinbeiber.twt+katyperry.twt+ladygaga.twt news:cnn.twt+bbcnews.twt+nytimes.twt POPvsNEWS.arff

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
examples		examples
tagger		tagger
tweets		tweets
word_rules		word_rules
.gitignore		.gitignore
README.md		README.md
buildarff.py		buildarff.py
helper.py		helper.py
pickledlexicon		pickledlexicon
to-do		to-do
twtt.py		twtt.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

examples

examples

tagger

tagger

tweets

tweets

word_rules

word_rules

.gitignore

.gitignore

README.md

README.md

buildarff.py

buildarff.py

helper.py

helper.py

pickledlexicon

pickledlexicon

to-do

to-do

twtt.py

twtt.py

Repository files navigation

About

Releases

Packages

Languages

sarahjanehong/birdy

Folders and files

Latest commit

History

Repository files navigation

About

Resources

Stars

Watchers

Forks

Languages