newsReports

News reports mining

Preprocessing steps:

Extract news report contents from LexisNexis(preprocessing_p1.py)
Applied Name Entity Recognition on the content (Java)
Designed an algorithm to extract street coordinates from news contents on GDelt data sets: use a rule-based algorithm + regular expression to find the address -> concatenate with city,state,country name provided by GDelt -> get geocoded.

Data: WashingtonPosts (2015-2016, ~1000 docs) Addresses (file: addresses): 'WP_address.csv', 'WP_address_window_1.csv' (w/ context window size =1)

Modeling:

Address labeling -> 'WP_adddress.csv'
Expand the context window by k (need to define k at the beginning of the script) -> 'labeling_expand_window.py'
Extract features and label the data (preparing the data frame for modeling) -> 'features_and_labeling.py'
Use machine learning algorithms to look for addresses -> 'model.py'

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
crime reports/selected data		crime reports/selected data
gedelt_data		gedelt_data
news_java_ner		news_java_ner
README.md		README.md
WP_address.csv		WP_address.csv
WP_address_window_1.csv		WP_address_window_1.csv
address_extraction.py		address_extraction.py
address_extraction2.py		address_extraction2.py
address_labeld.py		address_labeld.py
crime_report_extraction.py		crime_report_extraction.py
emnlp2016-submission.pdf		emnlp2016-submission.pdf
emnlp2016.pdf		emnlp2016.pdf
expand_labeling_window.py		expand_labeling_window.py
features_and_labeling.py		features_and_labeling.py
gdelt_Streets.py		gdelt_Streets.py
model.py		model.py
modeling_semi_.py		modeling_semi_.py
preprocessing.py		preprocessing.py
preprocessing_news_documents.py		preprocessing_news_documents.py

tmpsrcrepo/newsReports