Skip to content

karavokyrismichail/NLP---Language_Technology_Project

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 

Repository files navigation

NLP --- Language_Technology_Project

Project for the Language Technology Course of the Department of Computer Engineering & Informatics.

Description

Python Scripts that crawl news from online newspapers, clear and vectorize the texts in order to create an inverted xml file for searching articles using key words.

Scripts

crawler

  • Crawls news that appear in frontpage of online newspapers (NBC News, IFLS) and save the articles in database.

pos_tagging

  • Tokenizes each article from database and then finds POS-Tag for every token.
  • Saves the POS-Tagged articles in json file.

create_inverted_index

  • Reads the json file with the pos-tags.
  • Removes closed_class_category tags.
  • Lemmatizes open_class_category tags.
  • Joins the lemmatized words of each article and removes punctuation.
  • Vectorizes the articles and calculates the TF-IDF value of each word.
  • Creates an inverted xml file for future article searching.

queries_test

  • Given one or more words returns in right order the URLs of most relative articles.

test_time

  • Does queries automatically to calculate response time.

Tech stack

  • Python, VSC, XAMPP, MySQL, NLTK, scikit-learn, Beautiful Soup

About

System for Crawling and Indexing Websites in an Inverted Index.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages