DocumentCloud Summarize Add-On

This is an add-on for DocumentCloud that generates summaries for the documents given to it. It's a work in progress; pull requests and comments are welcome.

Setup

From the project root directory:

Set up a virtual environment with python3.9 -m venv venv.
Create a .env file with Put NLTK_DATA=nltk_data in it.
Put your DocumentCloud credentials in .env as USER and PASSWORD. (Keep your .env safe.)
Run venv/bin/pip install -r requirements.txt.
Run make install-nltk.
To run the local server: pip install -U flask, then flask run.

Testing

Here's how you can test this add-on locally.

Set up a venv: python3.9 -m venv venv then source venv/bin/activate.
Install pytest: pip install -U pytest.
Install the production dependencies: pip install -r requirements.txt.
Try running it locally: python tools/try-summarize.py.
- Look at the stdout output of the test in the terminal to make sure that the summary looks reasonable.

How it works

Here are the steps the add-on executes when summarizing:

The text is broken up into sentences.
Each sentence is encoded into an embedding (a vector) via the Universal Sentence Encoder.
The embeddings are put into K-means clusters.
The nearest neighbor embeddings to the centroids are collected.
Some neighbor embeddings are filtered out if they correspond to sentences that have indicators that they may be garbage.
The sentences corresponding to the remaining embeddings are formatted and returned as the summary.

The summarize function does the actual summarization, while Summarize.main handles getting the text out of the DocumentCloud documents, putting the summaries in a file, and uploading that file to DocumentCloud for the user to read.

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
.github/workflows		.github/workflows
example-docs		example-docs
summarize		summarize
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
__init__.py		__init__.py
app.py		app.py
config.example.yaml		config.example.yaml
config.yaml		config.yaml
main.py		main.py
requirements.txt		requirements.txt
try_summarize.py		try_summarize.py

License

MuckRock/documentcloud-summarize-addon

Folders and files

Latest commit

History

Repository files navigation

DocumentCloud Summarize Add-On

Setup

Testing

How it works

About

Resources

License

Stars

Watchers

Forks

Languages