Skip to content

Browser-based app for segmenting & OCRing PDF pages based on whitespace rules. To assist researchers (especially in the humanities) with turning their materials into machine-actionable datasets.

License

lizfischer/document-segmentation

Repository files navigation

PDF Segmentation Application

About

Document segmentation is the process of breaking a digital (or digitized) document into its constituent parts-- for example, splitting a scanned library catalog into individual records. Segmentation is a vital step in many digital humanities projects. Data are often trapped in PDF scans or photographs of materials that are not machine-actionable. Most OCR tools do some degree of layout analysis, but those tools do not allow enough customization in the layout analysis process for them to be useful-- for example, they may find the paragraph boundaries but fail detect when a new "meaningful unit" of content occurs.

This application is a browser-based tool for segmenting non-OCRed PDFs into individual, machine-readable text files. It takes advantage of the huge role whitespace plays in human understanding of a page of text, walking the user through creating custom rules about which bits of whitespace indicate a meaningful break in content, then acting on those rules to automate the separation of even very long documents.

This software was developed from Summer 2022-Spring 2023 as part of my dissertation project “Network Visualization and the Labor of Reference Work: Three Case Studies touching Medieval and Early-Modern Book History”

License

This software is released under a GNU General Public License v3.0. See LICENSE for details.

Credit

Getting Started

Requirements

  • Docker
  • A web browser -- tested with Firefox and Chrome, your mileage may vary with other browsers.

Installation

  • Clone this repository, or download and unzip the code in a location of your choosing
  • Open a terminal/command prompt and navigate to this repository
  • Run the command docker compose up
  • In your browser, navigate to http://127.0.0.1:5000/

User Guide

See the repository wiki for a guide to using the tool.

Accuracy

Preliminary ground-truth testing using inputs from four different source documents indicates this whitespace-based method of segmentation performs on average 57% better than textual pattern recognition (through regular expressions) alone.

Regex Only Whitespace Segmentation
Precision 45.12% 91.67%
Recall 75.51% 86.27%
F1 56.49% 88.89%

About

Browser-based app for segmenting & OCRing PDF pages based on whitespace rules. To assist researchers (especially in the humanities) with turning their materials into machine-actionable datasets.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published