Skip to content
This repository has been archived by the owner on Dec 18, 2019. It is now read-only.

Work directory organization

Jerome Flesch edited this page Nov 9, 2016 · 4 revisions

workdir|rootdir = ~/papers

Global organisation

In the work directory, you have folders, one per document.

The folder names are (usually) the scan/import date of the document: YYYYMMDD_hhmm_ss[_<idx>]. The suffix 'idx' is optional and is just a number added in case of name collision.

In every folder you have:

  • For image documents:
    • paper.<X>.jpg : A page in JPG format (X starts at 1)
    • paper.<X>.words (optional) : A hOCR file, containing all the words found on the page using the OCR (optional, but required for indexing ; can be regenerated with the options "Redo OCR (...)").
    • paper.<X>.thumb.jpg (optional, generated automatically) : A thumbnail version of the page (faster to load)
    • labels (optional) : a text file containing the labels applied on this document
    • extra.txt (optional) : extra keywords added by the user
  • For PDF documents:
    • doc.pdf : the document
    • labels (optional) : a text file containing the labels applied on this document
    • extra.txt (optional) : extra keywords added by the user
    • paper.<X>.words (optional) : A hOCR file, containing all the words found on the page using the OCR. Some PDF contains crap instead of the real text, so running the OCR on them can sometimes be useful.

Here is an example a work directory organisation:

$ find ~/papers
/home/jflesch/papers
/home/jflesch/papers/20130505_1518_00
/home/jflesch/papers/20130505_1518_00/paper.1.jpg
/home/jflesch/papers/20130505_1518_00/paper.1.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.1.words
/home/jflesch/papers/20130505_1518_00/paper.2.jpg
/home/jflesch/papers/20130505_1518_00/paper.2.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.2.words
/home/jflesch/papers/20130505_1518_00/paper.3.jpg
/home/jflesch/papers/20130505_1518_00/paper.3.thumb.jpg
/home/jflesch/papers/20130505_1518_00/paper.3.words
/home/jflesch/papers/20130505_1518_00/labels
/home/jflesch/papers/20110726_0000_01
/home/jflesch/papers/20110726_0000_01/paper.1.jpg
/home/jflesch/papers/20110726_0000_01/paper.1.thumb.jpg
/home/jflesch/papers/20110726_0000_01/paper.1.words
/home/jflesch/papers/20110726_0000_01/paper.2.jpg
/home/jflesch/papers/20110726_0000_01/paper.2.thumb.jpg
/home/jflesch/papers/20110726_0000_01/paper.2.words
/home/jflesch/papers/20110726_0000_01/extra.txt
/home/jflesch/papers/20130106_1309_44
/home/jflesch/papers/20130106_1309_44/doc.pdf
/home/jflesch/papers/20130106_1309_44/paper.1.words
/home/jflesch/papers/20130106_1309_44/paper.2.words
/home/jflesch/papers/20130106_1309_44/labels
/home/jflesch/papers/20130106_1309_44/extra.txt

hOCR files

With Tesseract, the hOCR file can be obtained with following command:

tesseract paper.<X>.jpg paper.<X> -l <lang> hocr && mv paper.<X>.html paper.<X>.words

For example:

tesseract paper.1.jpg paper.1 -l fra hocr && mv paper.1.html paper.1.words

Label files

Here is an example of content of a label file:

facture,#0000b1588c61
logement,#f6b6ffff0000

It's always [label],[color]. For a same label, the color should always be the same.