News reports mining
Preprocessing steps:
- Extract news report contents from LexisNexis(preprocessing_p1.py)
- Applied Name Entity Recognition on the content (Java)
- Designed an algorithm to extract street coordinates from news contents on GDelt data sets: use a rule-based algorithm + regular expression to find the address -> concatenate with city,state,country name provided by GDelt -> get geocoded.
Data: WashingtonPosts (2015-2016, ~1000 docs) Addresses (file: addresses): 'WP_address.csv', 'WP_address_window_1.csv' (w/ context window size =1)
Modeling:
- Address labeling -> 'WP_adddress.csv'
- Expand the context window by k (need to define k at the beginning of the script) -> 'labeling_expand_window.py'
- Extract features and label the data (preparing the data frame for modeling) -> 'features_and_labeling.py'
- Use machine learning algorithms to look for addresses -> 'model.py'