Skip to content

jamesflint/business-news-classification-engine-part-2

Repository files navigation

Business News Classification Engine (Part 2)

Springboard Capstone Project 2 | James Flint | mail@jamesflint.net | 2018-03-28

Problem

Building on the work done in my first Capstone Project, use CurationCorp’s labelled news database to create a topic classifier using multilayer, CNN, LSTM and VDCNN neural nets and make the best solution available via an online API.

Client

My client is CurationCorp.com. This project is the first step in a machine-learning-based classification, tagging and auto-summary project that should be able the company to automate much of its editorial process, and reserve human intervention for final edit and sign-off instead of low-level textual processing tasks.

Data

CurationCorp has a clean, human-curated database of 43,502 summarised and labelled news articles, to which I have access. For raw source data, I’ll be using a sample of data from a standard news database (LexisNexus Moreover, format: CSV) or the news aggregator dataset at http://archive.ics.uci.edu/ml/datasets/News+Aggregator.

Approach

  1. Data wrangling
  2. Compare Classifiers a. A multi-layer neural net (NN) b. A convolutional neural net (CNN) c. A long/short term memory neural net (LSTM) d. A very deep convolutional neural net (VDCNN)
  3. Build a prediction API

Deliverables contained in this repo

  • Code for all the above
  • An executable online API & manual interface (folder: "topic_api")
  • A paper describing the project process, methodologies, trade-offs and decision points (file: "Capstone Project 2 FINAL - Report - James Flint 20180330.pdf")
  • A slide deck presenting the project to CurationCorp and suggesting strategic implementation of the technology (file: "Capstone Project 2 FINAL - Presentation - James Flint 20180330.pdf")
  • A results matrix (excel file) containing the results of all trials (file: "Results matrix.xlsx")
  • All the Jupyter notebooks used in the project (folder: "notebooks")
  • Test articles for using with the manual interface (file: "Test articles for online form.rtf)

Attributions

I owe a huge thank you to all at the excellent Springboard, whose Data Science Career Track course culminated (for me) in this project, and in particular to my tutor Jan Zikeš, without whose expertise and encouragement I doubt it would ever have been finished!

On the way I begged, borrowed and stole code from many, many places in order to complete this project. The main sources are listed below, but there were others, including (of course) Stack Overflow. To anyone whose contribution I may have omited here, my apologies; please let me know and I'll add you in!

Running Locally

To run the API in the "topic_API" folder locally, set up a virtual environment running Python 3.6. Also, install the Heroku CLI; this should handle the installation of the other dependencies listed in "requirements.txt". To run the application, execute:

$ python main.py

License

This work is licensed under a Creative Commons Attribution 3.0 Unported License.

About

Springboard Data Science Capstone submission - Capstone Project 2 - FINAL SUBMISSION

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages