Project for the Language Technology Course of the Department of Computer Engineering & Informatics.
Python Scripts that crawl news from online newspapers, clear and vectorize the texts in order to create an inverted xml file for searching articles using key words.
- Crawls news that appear in frontpage of online newspapers (NBC News, IFLS) and save the articles in database.
- Tokenizes each article from database and then finds POS-Tag for every token.
- Saves the POS-Tagged articles in json file.
- Reads the json file with the pos-tags.
- Removes closed_class_category tags.
- Lemmatizes open_class_category tags.
- Joins the lemmatized words of each article and removes punctuation.
- Vectorizes the articles and calculates the TF-IDF value of each word.
- Creates an inverted xml file for future article searching.
- Given one or more words returns in right order the URLs of most relative articles.
- Does queries automatically to calculate response time.
- Python, VSC, XAMPP, MySQL, NLTK, scikit-learn, Beautiful Soup