Skip to content

Tag cloud generator that extracts hot keywords from Twitter page of a Persian news agency

Notifications You must be signed in to change notification settings

makbn/twitter_persian_news_tagcloud

Repository files navigation

Twitter Persian news tagcloud extraction

Final project of Information retrieval course.

TPNT is a Tag cloud generator that extracts hot keywords from Twitter page of a major Persian news agency in the fields of Economics and Socials for each month in a year.


Dependencies

  • GetOldTweets-java v1.2.0
  • Lucene 7.2.1

News agency

How to Run

This project has to main steps. First, twitts are stored in a csv file with the help of Crawler class. this class needs some options to work properly:

Flag Desc Requisition
-i The Id of twitter page required
-s Start date of extraction, format: YYY-MM-DD required
-e End date of extraction, format: YYY-MM-DD no
-m Limitation in the number of retrieved twitts no
-p Path of csv file no
-n Name of csv file no

An example for retrieving twitts from (@TasnimNews_Fa) starting from 2018-06-01 to 2018-07-01 in $PWD/result/ path:

java -cp ProjectNews.jar ir.ac.um.ce.projectnews.crawler.Crawler -i Tasnimnews_Fa -s 2018-06-01 -e 2018-07-01 -p result/

The next step is indexing docs. After removing stop-words from docs we use Searcher and Classifier classes plus a Bag of word to create some queries to estimate the correlation of each doc with context. Finally, we use the most corrolated words to generate a tag clud.

Contributors