Skip to content

technologiestiftung/parla-document-processor

Repository files navigation

All Contributors

parla-document-processor

This repository contains scripts for pre-processing PDF files for later use in the explorational project Parla. It offers a generic way of importing / registering and processing PDF documents. For the use case of Parla, the publicly accessible PDF documents of "Schriftliche Anfragen" and "Hauptausschussprotokolle" are used.

Prerequisites

Features

  • Register relevant documents from various data sources, see ./src/importers. Registering documents means storing their download URL and possible metadata in the database.

  • Process registered documents by

    1. Downloading the PDF
    2. Extracting text content from the PDF (either directly or via OCR)
    3. Generating a summary of the PDF content via OpenAI
    4. Generating a list of tags describing the PDF content via OpenAI
    5. Generating embedding vectors of each PDF page via OpenAI

Limitations

  • Only PDF documents are supported
  • The download URL of the documents must be publicly accessible
  • Documents with > 100 pages will not be processed (set via environment variable)
  • Documents with a content length of > 15000 tokens will not be summarized (set via environment variable)

Environment variables

See .env.sample

SUPABASE_URL=
SUPABASE_SERVICE_ROLE_KEY=
SUPABASE_DB_CONNECTION=
OPENAI_API_KEY=
OPENAI_MODEL=
OPENAI_EMBEDDING_MODEL=
PROCESSING_DIR=. // Directory for storing temporary processing files
ALLOW_DELETION=false // Documents with missing embeddings will not be deleted from the database
MAX_PAGES_LIMIT=64 // Documents with more pages than this will not be processed
MAX_DOCUMENTS_TO_PROCESS=1000 // A maximum number of documents to process

Run locally

⚠️ Warning: Running those scripts on many PDF documents will result in significant costs. ⚠️

  • Setup .env file based on .env.sample
  • Run npm ci to install dependencies
  • Run npx tsx ./src/run_import.ts to register the documents
  • Run npx tsx ./src/run_process.ts to process all unprocessed documents

Periodically regenerate indices

The indices on the processed_document_chunks and processed_document_summaries tables need be regenerated upon arrival of new data. This is because the lists parameter should be changed accordingly to https://github.com/pgvector/pgvector. To do this, we use the pg_cron extension available: https://github.com/citusdata/pg_cron. To schedule the regeneration of indices, we create two jobs which use functions defined in the API and database definition: https://github.com/technologiestiftung/parla-api.

select cron.schedule (
    'regenerate_embedding_indices_for_chunks',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_chunks() $$
);

select cron.schedule (
    'regenerate_embedding_indices_for_summaries',
    '30 5 * * *',
    $$ SELECT * from regenerate_embedding_indices_for_summaries() $$
);

Related repositories

Contributors ✨

Thanks goes to these wonderful people (emoji key):

Fabian Morón Zirfas
Fabian Morón Zirfas

💻 🤔
Jonas Jaszkowic
Jonas Jaszkowic

💻 🤔 🚇

This project follows the all-contributors specification. Contributions of any kind welcome!

Credits

Made by

A project by

Supported by

Related Projects

About

Pre-Processing of PDF documents for the "Parla" project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published