Springboard Capstone Project 2 | James Flint | mail@jamesflint.net | 2018-03-28
Building on the work done in my first Capstone Project, use CurationCorp’s labelled news database to create a topic classifier using multilayer, CNN, LSTM and VDCNN neural nets and make the best solution available via an online API.
My client is CurationCorp.com. This project is the first step in a machine-learning-based classification, tagging and auto-summary project that should be able the company to automate much of its editorial process, and reserve human intervention for final edit and sign-off instead of low-level textual processing tasks.
CurationCorp has a clean, human-curated database of 43,502 summarised and labelled news articles, to which I have access. For raw source data, I’ll be using a sample of data from a standard news database (LexisNexus Moreover, format: CSV) or the news aggregator dataset at http://archive.ics.uci.edu/ml/datasets/News+Aggregator.
- Data wrangling
- Compare Classifiers a. A multi-layer neural net (NN) b. A convolutional neural net (CNN) c. A long/short term memory neural net (LSTM) d. A very deep convolutional neural net (VDCNN)
- Build a prediction API
- Code for all the above
- An executable online API & manual interface (folder: "topic_api")
- A paper describing the project process, methodologies, trade-offs and decision points (file: "Capstone Project 2 FINAL - Report - James Flint 20180330.pdf")
- A slide deck presenting the project to CurationCorp and suggesting strategic implementation of the technology (file: "Capstone Project 2 FINAL - Presentation - James Flint 20180330.pdf")
- A results matrix (excel file) containing the results of all trials (file: "Results matrix.xlsx")
- All the Jupyter notebooks used in the project (folder: "notebooks")
- Test articles for using with the manual interface (file: "Test articles for online form.rtf)
I owe a huge thank you to all at the excellent Springboard, whose Data Science Career Track course culminated (for me) in this project, and in particular to my tutor Jan Zikeš, without whose expertise and encouragement I doubt it would ever have been finished!
On the way I begged, borrowed and stole code from many, many places in order to complete this project. The main sources are listed below, but there were others, including (of course) Stack Overflow. To anyone whose contribution I may have omited here, my apologies; please let me know and I'll add you in!
- Tokenizing text data in Keras
- Very Deep Convolutional Networks for Text Classification (paper)
- Keras implementation of a VDCNN model (code)
- Keras API 1
- Keras API 2
- Calculating the F1 metric in Keras
- Using GloVe in Python
- Text Classification using CNNs
- Global Vectors for Word Representation
- Build a CNN in 11 lines
- How to Develop a Bidirectional LSTM For Sequence Classification
- Text Generation With LSTM Recurrent Neural Networks
- Miguel Grinberg: The Flask Mega-Tutorial
- How to deploy a Python Flask app on Heroku
- Implementing a RESTful Web API with Python & Flask
- Adit Deshpande: A Beginner's Guide To Understanding Convolutional Neural Networks
- Reuters-21578 text classification with Gensim and Keras
- Classifying Yelp Reviews
To run the API in the "topic_API" folder locally, set up a virtual environment running Python 3.6. Also, install the Heroku CLI; this should handle the installation of the other dependencies listed in "requirements.txt". To run the application, execute:
$ python main.py
This work is licensed under a Creative Commons Attribution 3.0 Unported License.