This repository contains scripts for pre-processing PDF files for later use in the explorational project Parla. It offers a generic way of importing / registering and processing PDF documents. For the use case of Parla, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.
- Running and accessible Supabase database with the schema defined in https://github.com/technologiestiftung/parla-api
- OpenAI API Key
-
Register relevant documents from various data sources, see ./src/importers. Registering documents means storing their download URL and possible metadata in the database.
-
Process registered documents by
- Downloading the PDF
- Extracting text content from the PDF (either directly or via OCR)
- Generating a summary of the PDF content via OpenAI
- Generating a list of tags describing the PDF content via OpenAI
- Generating embedding vectors of each PDF page via OpenAI
- Only PDF documents are supported
- The download URL of the documents must be publicly accessible
- Documents with > 100 pages will not be processed (set via environment variable)
- Documents with a content length of > 15000 tokens will not be summarized (set via environment variable)
See .env.sample
SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=
PROCESSING_DIR=. // Directory for storing temporary processing files
ALLOW_DELETION=false // Documents with missing embeddings will not be deleted from the database
MAX_PAGES_LIMIT=64 // Documents with more pages than this will not be processed
MAX_DOCUMENTS_TO_PROCESS=1000 // A maximum number of documents to process
- Setup
.env
file based on.env.sample
- Run
npm ci
to install dependencies - Run
npx tsx ./src/run_import.ts
to register the documents - Run
npx tsx ./src/run_process.ts
to process all unprocessed documents
The indices on the processed_document_chunks
and processed_document_summaries
tables need be regenerated upon arrival of new data.
This is because the lists
parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron
extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.
select cron.schedule (
'regenerate_embedding_indices_for_chunks',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);
select cron.schedule (
'regenerate_embedding_indices_for_summaries',
'30 5 * * *',
$$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);
- API and database definition: https://github.com/technologiestiftung/parla-api
- Parla frontend: https://github.com/technologiestiftung/parla-frontend
Thanks goes to these wonderful people (emoji key):
Fabian Morón Zirfas 💻 🤔 |
Jonas Jaszkowic 💻 🤔 🚇 |
This project follows the all-contributors specification. Contributions of any kind welcome!
Made by
|
A project by
|
Supported by
|