Business News Classification Engine (Part 2)

Springboard Capstone Project 2 | James Flint | mail@jamesflint.net | 2018-03-28

Problem

Building on the work done in my first Capstone Project, use CurationCorp’s labelled news database to create a topic classifier using multilayer, CNN, LSTM and VDCNN neural nets and make the best solution available via an online API.

Client

My client is CurationCorp.com. This project is the first step in a machine-learning-based classification, tagging and auto-summary project that should be able the company to automate much of its editorial process, and reserve human intervention for final edit and sign-off instead of low-level textual processing tasks.

Data

CurationCorp has a clean, human-curated database of 43,502 summarised and labelled news articles, to which I have access. For raw source data, I’ll be using a sample of data from a standard news database (LexisNexus Moreover, format: CSV) or the news aggregator dataset at http://archive.ics.uci.edu/ml/datasets/News+Aggregator.

Approach

Data wrangling
Compare Classifiers a. A multi-layer neural net (NN) b. A convolutional neural net (CNN) c. A long/short term memory neural net (LSTM) d. A very deep convolutional neural net (VDCNN)
Build a prediction API

Deliverables contained in this repo

Code for all the above
An executable online API & manual interface (folder: "topic_api")
A paper describing the project process, methodologies, trade-offs and decision points (file: "Capstone Project 2 FINAL - Report - James Flint 20180330.pdf")
A slide deck presenting the project to CurationCorp and suggesting strategic implementation of the technology (file: "Capstone Project 2 FINAL - Presentation - James Flint 20180330.pdf")
A results matrix (excel file) containing the results of all trials (file: "Results matrix.xlsx")
All the Jupyter notebooks used in the project (folder: "notebooks")
Test articles for using with the manual interface (file: "Test articles for online form.rtf)

Attributions

I owe a huge thank you to all at the excellent Springboard, whose Data Science Career Track course culminated (for me) in this project, and in particular to my tutor Jan Zikeš, without whose expertise and encouragement I doubt it would ever have been finished!

On the way I begged, borrowed and stole code from many, many places in order to complete this project. The main sources are listed below, but there were others, including (of course) Stack Overflow. To anyone whose contribution I may have omited here, my apologies; please let me know and I'll add you in!

Running Locally

To run the API in the "topic_API" folder locally, set up a virtual environment running Python 3.6. Also, install the Heroku CLI; this should handle the installation of the other dependencies listed in "requirements.txt". To run the application, execute:

$ python main.py

License

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

notebooks

notebooks

topic_api

topic_api

.DS_Store

.DS_Store

Capstone Project 2 FINAL - Presentation - James Flint 20180330.pdf

Capstone Project 2 FINAL - Presentation - James Flint 20180330.pdf

Capstone Project 2 FINAL - Report - James Flint 20180330.pdf

Capstone Project 2 FINAL - Report - James Flint 20180330.pdf

README.md

README.md

Results matrix.xlsx

Results matrix.xlsx

Test articles for online form.rtf

Test articles for online form.rtf

create_dataframe.py

create_dataframe.py

Repository files navigation

Business News Classification Engine (Part 2)

Springboard Capstone Project 2 | James Flint | mail@jamesflint.net | 2018-03-28

Problem

Client

Data

Approach

Deliverables contained in this repo

Attributions

Running Locally

License

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
notebooks		notebooks
topic_api		topic_api
.DS_Store		.DS_Store
Capstone Project 2 FINAL - Presentation - James Flint 20180330.pdf		Capstone Project 2 FINAL - Presentation - James Flint 20180330.pdf
Capstone Project 2 FINAL - Report - James Flint 20180330.pdf		Capstone Project 2 FINAL - Report - James Flint 20180330.pdf
README.md		README.md
Results matrix.xlsx		Results matrix.xlsx
Test articles for online form.rtf		Test articles for online form.rtf
create_dataframe.py		create_dataframe.py

jamesflint/business-news-classification-engine-part-2

Folders and files

Latest commit

History

Repository files navigation

Business News Classification Engine (Part 2)

Springboard Capstone Project 2 | James Flint | mail@jamesflint.net | 2018-03-28

Problem

Client

Data

Approach

Deliverables contained in this repo

Attributions

Running Locally

License

About

Resources

Stars

Watchers

Forks

Languages