Skip to content

truthandtransparency/scrubber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Scrubber

A tool to "scrub" PDFs of metadata, fingerprints, watermarks, and other identifying data and optimize them for publication to the web.

Scrubber is a wrapper combining several other open source tools including PDF Redact Tools, OCRmyPDF, and QPDF. It was originally written with the intent to protect whistleblowers submitting documents to the TTF.

Scrubber achieves this goal in the following steps:

  1. Use pdf-redact-tools to convert each page of the PDF into an individual image and then combine them back to a single PDF.
  • This is in effort to remove any digital watermarks or fingerprints.
  • The user has the option to just produce images, redact them separately, and then merge later.
  1. Make the PDF searchable by adding the text layer with ocrmypdf.
  • ocrmypdf also compresses the file size of the PDF.
  1. Optimize the PDF to be hosted online by linearizing it with qpdf.
  2. Remove any exif data created in the previous 3 steps using exiftool.

PREREQUISITES:

You must have Docker installed. It can be installed on CentOS, Debian, Fedora, Ubunutu, macOS, and Windows.

USAGE:

The easiest and most simple way to run Scrubber is by running the following command:

./scrubber.sh -i /path/to/input/pdf -o /path/to/desired/output/pdf

That will pull down the official Scrubber docker container and work it's magic. If you'd rather not pull the container from Docker Hub and build is locally, you can run:

./scrubber.sh --local -i /path/to/input/pdf -o /path/to/desired/output/pdf

*NOTE: Currently Scrubber requires that absolute filepaths be passed to it. This will be fixed in the near future.

A full list of command options is as follows:

Command:
[] = required parameters
<> = optional parameters

scrubber.sh <options> -i|--input [input_file_path] -o|--output [output_file_path] 

Options:
 --local: Build docker image locally rather than pulling form Docker Hub
 -l, --language <language>: Language for PDF OCR (https://github.com/tesseract-ocr/tesseract/blob/master/doc/tesseract.1.asc#languages)
 -a, --achromatic: Remove color from PDF to avoid Reality Winner's situation. Cannot be set with -m or --merge
 -r, --redact: Just separate the pages of the PDF into PNGs for redactions
 -m, --merge: Just merge the PNGs
 -h, --help: Print this help text
 -i, --input: PDF to scrub
 -o, --output: Filepath for clean PDF

FILES

Dockerfile

The file used to build the Docker image. More info can be read here.

scrub.sh

The bash script that actually does the 4 steps outlined above. This script is copied to the Docker container to /usr/local/bin/scrub.

scrubber.sh

A bash wrapper around the appropriate docker commands.

common_functions.sh

A bash script containing functions used in both scrub.sh and scrubber.sh.

LICENSING

Scrubber is licensed under the GNU General Public License v3.0.

DISCLAIMER

There is absolutely no guarantee that Scrubber will completely clean all potentially identifying information from a document. It should not be treated as a "silver bullet".

ROADMAP

About

A tool to "scrub" PDFs of metadata, fingerprints, watermarks, and other identifying data and optimize them for publication to the web.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published