PDF Segmentation Application

About

Document segmentation is the process of breaking a digital (or digitized) document into its constituent parts-- for example, splitting a scanned library catalog into individual records. Segmentation is a vital step in many digital humanities projects. Data are often trapped in PDF scans or photographs of materials that are not machine-actionable. Most OCR tools do some degree of layout analysis, but those tools do not allow enough customization in the layout analysis process for them to be useful-- for example, they may find the paragraph boundaries but fail detect when a new "meaningful unit" of content occurs.

This application is a browser-based tool for segmenting non-OCRed PDFs into individual, machine-readable text files. It takes advantage of the huge role whitespace plays in human understanding of a page of text, walking the user through creating custom rules about which bits of whitespace indicate a meaningful break in content, then acting on those rules to automate the separation of even very long documents.

This software was developed from Summer 2022-Spring 2023 as part of my dissertation project “Network Visualization and the Labor of Reference Work: Three Case Studies touching Medieval and Early-Modern Book History”

License

This software is released under a GNU General Public License v3.0. See LICENSE for details.

Credit

Getting Started

Requirements

Docker
A web browser -- tested with Firefox and Chrome, your mileage may vary with other browsers.

Installation

Clone this repository, or download and unzip the code in a location of your choosing
Open a terminal/command prompt and navigate to this repository
Run the command docker compose up
In your browser, navigate to http://127.0.0.1:5000/

User Guide

See the repository wiki for a guide to using the tool.

Accuracy

Preliminary ground-truth testing using inputs from four different source documents indicates this whitespace-based method of segmentation performs on average 57% better than textual pattern recognition (through regular expressions) alone.

	Regex Only	Whitespace Segmentation
Precision	45.12%	91.67%
Recall	75.51%	86.27%
F1	56.49%	88.89%

Name		Name	Last commit message	Last commit date
Latest commit History 79 Commits
interface		interface
migrations		migrations
testfiles		testfiles
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml
dump_to_txt.py		dump_to_txt.py
find_gaps.py		find_gaps.py
image_generation.py		image_generation.py
models.py		models.py
notes		notes
parse_rules.py		parse_rules.py
requirements.txt		requirements.txt
tesseract.py		tesseract.py
utils.py		utils.py
whitespace_helpers.py		whitespace_helpers.py
wsgi.py		wsgi.py

License

lizfischer/document-segmentation

Folders and files

Latest commit

History

Repository files navigation

PDF Segmentation Application

About

License

Credit

Getting Started

Requirements

Installation

User Guide

Accuracy

About

Topics

Resources

License

Stars

Watchers

Forks

Languages