Skip to content

Experiments on the application of generative AI for the retro-digitisation of printed editions

License

Notifications You must be signed in to change notification settings

SSRQ-SDS-FDS/ssrq-retro-lab

Repository files navigation

SSRQ Retro Lab DOI

This repository contains code (Python scripts as well as Jupyter notebooks) and data of retrodigitized units from the collection of Swiss Law Sources (SLS). The data is used for various experiments to evaluate the quality of the digitization process, improve the quality of OCR results, and develop a workflow for the retrodigitization of the SLS collection. Furthermore, it demonstrates potential ways for further use of the data by employing advanced methods such as topic modeling or named entity recognition.

Table of Contents

Background

Swiss Law Sources

The Swiss law Sources were established at the end of the 19th century by the Swiss Lawyers' Association with the aim of making sources of the legal history of Switzerland accessible to an interested public. The collection of legal sources is nowadays supported by a foundation established in 1980. Part of this foundation is the ongoing research project under the direction of Pascale Sutter. About 15 years ago, the Foundation decided to start digitizing the collection of Swiss law sources. The result of this process is the online platform called "SSRQ Online", which makes all scanned books as PDFs available to the public. The PDFs have been further processed by OCR software, but no correction or any other post-processing (e.g. annotation of named entities) has been done so far. The PDFs are just the starting point for a long journey of further processing and analysis.

Idea of the 'Retro Lab'

The idea of the 'Retro Lab' is to use the digitized volumes of the SLS collection as a test bed for various experiments. Different methods and tools are used to evaluate the quality of the digitization process, to improve the quality of the OCR results and to develop a workflow for the retrodigitization of the SLS collection. A special focus is on the usage of generative AI models like GPT-3.5/4 to create an advanced processing pipeline, where most of the hard work is done by the AI.

Data and Code

Data

The data is stored in the folder data. It contains the following subfolders:

  • export: Contains a ground truth transcription of 53 pages from two volumes. This transcription was created in transkribus and exported as a txt file.
  • ZG: Contains the OCR results of the volume "ZG" (Zug) as a PDF file. The OCR results were created by the OCR software ABBYY Finereader. Furthermore, it contains training and validation data as txt- and json-files.

Code

The code of the project is divided into two parts:

  1. Utility code, organized in python modules (everything beneath src)
  2. Analysis code, organized in Jupyter notebooks (everything beneath notebooks)

All dependencies are listed in the pyproject.toml file. The code is written for python >= 3.11. The management of virtual environments is done with hatch. To create a new virtual environment, run hatch env create in the root directory of the project. To activate the environment, run hatch env shell. The environment will have all dependencies installed.

Note: You will need a valid API key for the OpenAI API to run the notebooks.

Experiments

v1 of the experiment

For the first iteration of the experiments take a look at the v1-branch.

v2 of the experiments

The second iteration of the experiments tries to use a slightly different approach. Instead of just relying on the extracted plain text and trying to use a Large Language Model for further processing (like recognition of different documents) a mixed approach is used, which combines 'classical' methods with the usage of a Large Language Model. Therefore, a pipeline is created, which uses a combination of Python scripts and calls a LLM just for the parts, where it is really needed. The pipeline is shown in the following figure:

Pipeline

Each component is validated by a simple set of tests, which are located in the tests folder.

No Langchain – why?Langchain is a powerful, but also complex, framework. Most of it's features are not needed for the experiments. Instead a custom pipeline (chain) is created, which is tailored to the needs of the experiments.

Pipeline Components

The text extraction component is responsible for extracting the plain text (as HTML string) from the PDF(s). The document / article number is used as an input parameter. Beside the PDF it uses the XML Table of Content, which is the base for "SSRQ Online". It returns an object with the following structure:

class TextExtractionResult(TypedDict):
    entry: VolumeEntry
    pages: tuple[str, ...]

The VolumeEntry is a simple data class, which contains the metadata of the volume. The pages attribute is a tuple of strings, where each string represents the extracted text of a page as HTML string. The extraction and HTML conversion is handled by PyMuPDF.

This component takes the extracted HTML string(s) and tries to extract the relevant HTML elements for the requested article. Like the first component it does not use a LLM.

Shortcoming of this component:

  • Relies on the correct structure of the HTML string
  • Relies on the OCR results of the PDF
  • Quick & dirty implementation to find the relevant HTML elements

Returns the following object:

class HTMLTextExtractionResult(TextExtractionResult):
    article: tuple[Selector, ...]

This component uses the nodes, extracted in the previous step, and tries to classify each of them. Uses the default GPT-4 model from OpenAI. The classification is done by using a prompt with Few-Shot examples:

Returns the following object:

class StructuredArticle(BaseModel):
    article_number: int
    date: str
    references: list[str]
    summary: list[str]
    text: list[str]
    title: str

The component is tested against a few examples. The accuracy of this test cases is above 90%. See the test cases for more details.

This component uses the structured article created in the previous step and tries to correct the OCR results. It uses a fine-tuned GPT-3.5 model for this task. The data used for fine-tuning can be found in the data section. Some validation is done in a Jupyter notebook.

It returns the following object:

class StructuredCorrectedArticle(StructuredArticle):
    corrected_references: CorrectedOCRText
    corrected_summary: CorrectedOCRText
    corrected_text: CorrectedOCRText

Things left open:

  • Correct summary and references
  • Implement better validation for the OCR correction

As the last processing step some Named Entity Recognition (NER) is done. The NER is backed by spacy-llm, which uses a GPT-4 model and parses the output into a structured spacy document. Some simple validation is done here.

It returns a tuple, which contains the StructuredCorrectedArticle and the spacy.Doc.

Last but not least the result is converted into a TEI XML file. The TEI XML file is created by using a simple template, which is filled with the data of the StructuredCorrectedArticle and the spacy.Doc. The template can be found here.

Demo

The following video shows a demo of the complete process in a simple UI built with gradio. To speed up the process an article is used, which was already processed. The results for this article are retrieved from the cache. The cache is implemented with diskcache-library.

pipeline-demo.mov

To-Dos

This experiment is a first prototype, it is not ready for production use and there are some things left open:

  • Implement better validation for all components
  • Implement a 'Human in the Loop' for all steps in the pipeline
  • Improve performance by bundling requests and / or using concurrent requests to external service (like OpenAI API)
  • Implement checks for the prompts send to the LLM (e.g. check for the length of the prompt)
  • ...

Talks

The work done here will be presented in the context of the following talks:

  • Bastian Politycki, Pascale Sutter, Christian Sonder: „Datenschätze heben. Ein Bericht zur Digitalisierung der Sammlung Schweizerischer Rechtsquellen“. Editions als Transformation. Plenartagung der AG für germanistische Edition, 21.–24. Februar 2024, Bergische Universität Wuppertal. Slides will be linked here after the talk.
  • Bastian Politycki: „Anwendung generativer KI zur Digitalisierung gedruckter Editionen am Beispiel der Sammlung Schweizerischer Rechtsquellen“. W8: Generative KI, LLMs und GPT bei digitalen Editionen, DHd2024 Passau, 26.02.2024—01.03.2024. Slides will be linked here after the talk.

Authors

Bastian Politycki – University of St. Gallen / Swiss Law Sources

References

Tools used

For a complete list see pyproject.toml.

About

Experiments on the application of generative AI for the retro-digitisation of printed editions

Resources

License

Stars

Watchers

Forks

Packages

No packages published