tawlk_text_extractor

matlab-based text extractor for TAWLK-format social media data files

README

This code extracts text from social media data files, gathered using the software Kral. The source files are ascii text and formatted as follows:

The file 'script extract unique content' allows you to extract to an output file, just the post_content, with a little pre-processing to remove duplicate posts and http://t.co/* links (helps for identifying unique posts, since retweets may have different t.co links).

Once you have extracted the data, a word cloud generator can display the data nicely. The contents can be pasted into http://www.wordle.net/create .

A few example files are included. They are for earthquake data gathered on November 7, 2012. There is a file for prior to the Guatemala earthquake (baseline), just before, at peak data, and a few hours afterwards. The times are indicated by the timestamp of the file name, and are for 30-minute increments. PDFs of Wordle outputs are also included, for the 'script extract unique content.m' file.

Enjoy!

More information on Kral, and the related company Tawlk, can be found at http://www.tawlk.com or in the paper "Hybrid Browser / Server Collection of Streaming Social Media Data for Scalable Real-Time Analysis.", RAMSS 2012 workshop at ICWSM-12, 4 June 2012, by Lance Reagan Vick, Titus Soporan, Daniel Robert Lewis and Jane Brooks Zurn. http://www.aaai.org/ocs/index.php/ICWSM/ICWSM12/paper/view/4787

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
wordcloud_plots		wordcloud_plots
.gitattributes		.gitattributes
.gitignore		.gitignore
4787-22045-1-PB vick 2012 tawlk.pdf		4787-22045-1-PB vick 2012 tawlk.pdf
README.md		README.md
extract_tweet_content.m		extract_tweet_content.m
extract_unique_tweet_content.m		extract_unique_tweet_content.m
formatKral.m		formatKral.m
remove_first_word.m		remove_first_word.m
remove_tco_links.m		remove_tco_links.m
script extract unique content.m		script extract unique content.m
split_and_remove_tco_from_string.m		split_and_remove_tco_from_string.m

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

wordcloud_plots

wordcloud_plots

.gitattributes

.gitattributes

.gitignore

.gitignore

4787-22045-1-PB vick 2012 tawlk.pdf

4787-22045-1-PB vick 2012 tawlk.pdf

README.md

README.md

extract_tweet_content.m

extract_tweet_content.m

extract_unique_tweet_content.m

extract_unique_tweet_content.m

formatKral.m

formatKral.m

remove_first_word.m

remove_first_word.m

remove_tco_links.m

remove_tco_links.m

script extract unique content.m

script extract unique content.m

split_and_remove_tco_from_string.m

split_and_remove_tco_from_string.m

Repository files navigation

tawlk_text_extractor

About

Releases

Packages

Languages

jbzurn/tawlk_text_extractor

Folders and files

Latest commit

History

Repository files navigation

tawlk_text_extractor

About

Resources

Stars

Watchers

Forks

Languages