NLP --- Language_Technology_Project

Project for the Language Technology Course of the Department of Computer Engineering & Informatics.

Description

Python Scripts that crawl news from online newspapers, clear and vectorize the texts in order to create an inverted xml file for searching articles using key words.

Scripts

crawler

Crawls news that appear in frontpage of online newspapers (NBC News, IFLS) and save the articles in database.

pos_tagging

Tokenizes each article from database and then finds POS-Tag for every token.
Saves the POS-Tagged articles in json file.

create_inverted_index

Reads the json file with the pos-tags.
Removes closed_class_category tags.
Lemmatizes open_class_category tags.
Joins the lemmatized words of each article and removes punctuation.
Vectorizes the articles and calculates the TF-IDF value of each word.
Creates an inverted xml file for future article searching.

queries_test

Given one or more words returns in right order the URLs of most relative articles.

test_time

Does queries automatically to calculate response time.

Tech stack

Python, VSC, XAMPP, MySQL, NLTK, scikit-learn, Beautiful Soup

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
source		source
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

source

source

LICENSE

LICENSE

README.md

README.md

Repository files navigation

NLP --- Language_Technology_Project

Description

Scripts

crawler

pos_tagging

create_inverted_index

queries_test

test_time

Tech stack

About

Releases

Packages

Languages

License

karavokyrismichail/NLP---Language_Technology_Project

Folders and files

Latest commit

History

Repository files navigation

NLP --- Language_Technology_Project

Description

Scripts

crawler

pos_tagging

create_inverted_index

queries_test

test_time

Tech stack

About

Topics

Resources

License

Stars

Watchers

Forks

Languages