parla-document-processor

This repository contains scripts for pre-processing PDF files for later use in the explorational project Parla. It offers a generic way of importing / registering and processing PDF documents. For the use case of Parla, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.

Prerequisites

Running and accessible Supabase database with the schema defined in https://github.com/technologiestiftung/parla-api
OpenAI API Key

Features

Register relevant documents from various data sources, see ./src/importers. Registering documents means storing their download URL and possible metadata in the database.
Process registered documents by
1. Downloading the PDF
2. Extracting text content from the PDF (either directly or via OCR)
3. Generating a summary of the PDF content via OpenAI
4. Generating a list of tags describing the PDF content via OpenAI
5. Generating embedding vectors of each PDF page via OpenAI

Limitations

Only PDF documents are supported
The download URL of the documents must be publicly accessible
Documents with > 100 pages will not be processed (set via environment variable)
Documents with a content length of > 15000 tokens will not be summarized (set via environment variable)

Environment variables

See .env.sample

SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=
PROCESSING_DIR=. // Directory for storing temporary processing files
ALLOW_DELETION=false // Documents with missing embeddings will not be deleted from the database
MAX_PAGES_LIMIT=64 // Documents with more pages than this will not be processed
MAX_DOCUMENTS_TO_PROCESS=1000 // A maximum number of documents to process

Run locally

⚠️ Warning: Running those scripts on many PDF documents will result in significant costs. ⚠️

Setup .env file based on .env.sample
Run npm ci to install dependencies
Run npx tsx ./src/run_import.ts to register the documents
Run npx tsx ./src/run_process.ts to process all unprocessed documents

Periodically regenerate indices

The indices on the processed_document_chunks and processed_document_summaries tables need be regenerated upon arrival of new data. This is because the lists parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.

select cron.schedule (
    'regenerate_embedding_indices_for_chunks',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);

select cron.schedule (
    'regenerate_embedding_indices_for_summaries',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);

Related repositories

API and database definition: https://github.com/technologiestiftung/parla-api
Parla frontend: https://github.com/technologiestiftung/parla-frontend

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Fabian Morón Zirfas}
💻 🤔

_{Jonas Jaszkowic}
💻 🤔 🚇

This project follows the all-contributors specification. Contributions of any kind welcome!

Credits

Made by

A project by

Supported by

Name		Name	Last commit message	Last commit date
Latest commit History 64 Commits
.github/workflows		.github/workflows
src		src
.all-contributorsrc		.all-contributorsrc
.env.sample		.env.sample
.gitignore		.gitignore
README.md		README.md
db_schema.ts		db_schema.ts
package-lock.json		package-lock.json
package.json		package.json
renovate.json		renovate.json
tsconfig.json		tsconfig.json

technologiestiftung/parla-document-processor

Folders and files

Latest commit

History

Repository files navigation

parla-document-processor

Prerequisites

Features

Limitations

Environment variables

Run locally

Periodically regenerate indices

Related repositories

Contributors ✨

Credits

Related Projects

About

Resources

Stars

Watchers

Forks

Languages