autoarchiver

autoarchiver is a scanning, OCRing and archiving-solution built on top of various Linux libraries and utilities.

It offers a date and tag-based storage hierarchy on top of a regular file-system making it easy to navigate, maintain and backup through standard backup technologies and utilities like rsync, AWS S3, etc.

features

Main features:

Command-line only.
Minimal syntax. Provide only what is at minimum needed.
Automatic date-extraction from scanned documents.
Archive existing documents into same hierarchy, and get text indexing included for free.
Easily locate documents based on tags or content.

dependencies

autoarchiver depends on the following software:

Python3 (only standard modules required)
SANE/XSANE (for scanning)
ImageMagick (for OCR pre-processing)
Tesseract & Leptonica (for OCR)
Exact-image (for PDF processing)

Most of these should be available as packages on Debian-based distros. As of Ubuntu 16.04 this also includes Tesseract, but on earlier distries Tesseract may need to be built from source in order to support the HOCR-output used by Exact-image.

installation

On Ubuntu 16.04 you can use the following commands to install all dependencies and setup autoarchiver for simple CLI usage:

$ sudo apt install python3 sane-utils imagemagick tesseract-ocr
$ cd $HOME
$ git clone https://github.com/josteink/autoarchiver
$ mkdir -p $HOME/bin
$ export PATH=$PATH:$HOME/bin
$ ln -s $HOME/autoarchiver/archive.py $HOME/bin/da

usage

To use autoarchiver simply invoke the archive.py (or whatever short-hand alias you've created) from the command-line:

$ ./archive.py --help
usage: archive.py [-h] [--date DATE] [--file FILE] [tags [tags ...]]

positional arguments:
  tags                  The tags to apply to the document.

optional arguments:
  -h, --help            show this help message and exit
  --date DATE, -d DATE  Date of the archived document. Use when auto-detection
                        fails.
  --file FILE, -f FILE  The file to archive. If omitted, document will be
                        retrieved from scanner.

Archived documents will be stored in $HOME/DocumentArchive sorted on date, and stored with tags.

Both a plain-text OCRed representation (easily searched by grep) and a PDF containing the originally scanned document merged with the OCRed data will be available in a per-archived document folder.

Date will be attempted detected from the document. Overriding detection can be done from the command-line. You can use as minimal syntax as possible "31-12" will be interpreted as December 31th, this year. Etc.

searching/querying

Based on the filesystem layout you can easily identity documents based either on the tags used to archive it, the period it was archived or the content of the scanned document using standard Linux command-line tools like find and grep.

A simple tool for these kind of queries is included in the form of a shell-script called da-search.

Name		Name	Last commit message	Last commit date
Latest commit History 62 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
archive.py		archive.py
da-search.sh		da-search.sh
makefile		makefile
tests.py		tests.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.github/workflows

.github/workflows

.gitignore

.gitignore

README.md

README.md

archive.py

archive.py

da-search.sh

da-search.sh

makefile

makefile

tests.py

tests.py

Repository files navigation

autoarchiver

features

dependencies

installation

usage

searching/querying

About

Releases

Packages

Contributors 2

Languages

josteink/autoarchiver

Folders and files

Latest commit

History

Repository files navigation

autoarchiver

features

dependencies

installation

usage

searching/querying

About

Resources

Stars

Watchers

Forks

Languages