Korean-NLP-Project

Last Edit: Sept 12, 2017

Overview

A data science project about using NLP techniques on Korean news articles. Attempts to achieve a quick, unsupervised, automatic, and dynamic topic clustering on news articles retrieved from a keyword. The articles are retrieved from a local Elasticsearch database that is indexed with Korean news articles and can be updated in real-time.

Key technologies include: Elasticsearch, KoNLPy, Word2Vec, and HDBSCAN.

Package requirements

Python 3
Doc2Vec models: Download models. Place the files inside /models
Basic: numpy, sklearn, pandas, beautifulsoup4, sklearn, matplotlib
Elasticsearch: pip install elasticsearch
Gensim: conda install gensim
HDBSCAN: pip install hdbscan OR conda install -c conda-forge hdbscan
KoNLPy: Instructions here
with Mecab-ko: Mecab for Windows (requires some environment variable tweaking)
- Direct repo link just in case
- For Non-Window Python 3
Networkx: pip install networkx
PyTagCloud: Instructions
- Add Korean font to python3/site-packages/pytagcloud and edit .json file
Googletrans: Instructions (optional)

Basic process:

Make sure Elasticsearch is running and the database is updated
Input a keyword which will retrieve up to a 1000 relevant articles (accounts for time relevancy)
A pre-trained Doc2Vec model is loaded and is used to infer the vectors of the 1000 articles.
The vectors are labeled into clusters using HDBSCAN (density based clustering)
Optional visualization of the clusters

Model Flowchart

A more in-depth explanation of the model can be found here.

Data source

The news articles are from the Naver's news hub: http://news.naver.com/main/officeList.nhn
The selected news outlets for this project are as follows:
- Outlet Name: Source ID
- 국민일보: 005
- 동아일보: 020
- 문화일보: 021
- 세계일보: 022
- 조선일보: 023
- 중앙일보: 025
- 한겨례: 028
- 경향신문: 032
- 서울신문: 081
- 한국일보: 469
For the initial bootstraping for my database and model training, I scraped about a year worth of articles from these sources (from Aug 2016 to Aug 2017) except for 조선일보 (023) Where I only scraped 6 month worth of data.

Future Works

Allow Elasticsearch address parameters to be inputted (Currently only localhost:9200)
Add internet disconnect detection for scraper
Improve search algorithm
Untapped sentiment data for analysis
Add automatic and continuous Doc2Vec training
Implement the Phraser model (model exists but isn't implemented due to uncertainty of its performance)
Improve text processing
- Improve author name extraction algorithm
- Improve noise and template filtering algorithm (ex. delete all '동아일보', '포토' articles)

Contact

If you have any questions regarding this project, feel free to email me.

Email: inhorha5+github@gmail.com

Name		Name	Last commit message	Last commit date
Latest commit History 39 Commits
bootstrap_codes		bootstrap_codes
data		data
image_outputs		image_outputs
logs		logs
src		src
text_outputs		text_outputs
.gitignore		.gitignore
DETAILS.md		DETAILS.md
Diagram.png		Diagram.png
Ko_NLP.png		Ko_NLP.png
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bootstrap_codes

bootstrap_codes

data

data

image_outputs

image_outputs

logs

logs

src

src

text_outputs

text_outputs

.gitignore

.gitignore

DETAILS.md

DETAILS.md

Diagram.png

Diagram.png

Ko_NLP.png

Ko_NLP.png

README.md

README.md

Repository files navigation

Korean-NLP-Project

Last Edit: Sept 12, 2017

Overview

Package requirements

Basic process:

Model Flowchart

Data source

Future Works

Contact

About

Releases

Packages

Languages

edwardrha/Korean-NLP-Project

Folders and files

Latest commit

History

Repository files navigation

Korean-NLP-Project

Last Edit: Sept 12, 2017

Overview

Package requirements

Basic process:

Model Flowchart

Data source

Future Works

Contact

About

Topics

Resources

Stars

Watchers

Forks

Languages