Skip to content

grantat/paywall-classify

Repository files navigation

Paywall-classify

A tensorflow and puppeteer based app that takes a list of URIs, maximum of 10 per request, gathers thumbnail screenshots for each of the sites and then classifies if the URI requested is a paywall or content page.

Installation

This app utilizes Python 3 (tensorflow) and Nodejs 8 (puppeteer) and is therefore recommended to build this app with Docker. To build the image:

$ docker build -t paywall-classify .

To run the server:

$ docker run -it --rm -p 5000:5000 paywall-classify

Then the server is accessible from: http://0.0.0.0:5000/.

Data

The images used to train the image classifier are included in the Docker image build, but can also be found here: http://www.cs.odu.edu/~gatkins/public_data/paywall-training-images.tgz. It consists of 122 paywall_page images and 119 content_page images.

About

Classify a URI's thumbnail as paywall or content

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published