Skip to content

CrawlyOEG/PDFExtractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

PdfExtractor

PdfExtractor is a library to obtain all the resources of a pdf.

© 2018 Jorge Galán - OEG-UPM. Available under Apache License 2.0. See LICENSE.

Features

  • Extract your text and images in a PDF, thanks to PDFBox technology.
  • Extract your tables in a PDF, thanks to Tabula technology.
  • Choose the parts to extract from your PDF, by bookmarks, by pages or all PDF.

Download

Download a version of the PdfExtractor's jar from our releases page.

Usage

PdfExtractor provides a command line application:

$java -jar PdfExtractor.jar --help 
usage: PDFExtractor [-b <NUMBERS>] [-f] [-h] [-i <input PDF or FOLDER>]
       [-o <output FOLDER>] [-p <NUMBERS>] [-r]
Mised argument
 -b,--bookmark <NUMBERS>            [OPTIONAL] ¡NOT AT SAME THAN -p! By
                                    default, the extractor extract all of
                                    them.
                                    If the PDF has BOOKMARKS, we extract
                                    all content from selected. Using comma
                                    separated or list of ranges to
                                    listExamples: --bookmark 1-3,5-7,
                                    --bookmark 3.
 -f,--fix                           [EXPERIMENTAL] Force PDF to be
                                    extracted adjunting words, deleting
                                    files, deleting footers, .. By
                                    default, disabled
 -h,--help                          Indicate how yo use the program.
 -i,--input <input PDF or FOLDER>   [REQUIRED] Absolute Pdf or folder with
                                    PDF location path. Ex:
                                    /Users/thoqbk/table.pdf
 -o,--output <output FOLDER>        Absolute output file. By default the
                                    folder on i or the parent. Ex:
                                    /Users/thoqbk/results
 -p,--pages <NUMBERS>               [OPTIONAL] ¡NOT AT SAME THAN -p! By
                                    default, the extractor extract all of
                                    them.
                                    Using comma separated or list of
                                    ranges to list to select
                                    pagesExamples: --pages 1-3,5-7,
                                    --pages 3.
 -r,--resources                     Try to extract all resources from PDF
                                    (text, image and tables). By default,
                                    disabled

The option --fix try to join parts of separate words together, remove footers and headers, and remove tables from the final text

Building from Source

Clone this repo and run:

mvn clean compile assembly:single

Then, get your own version of the jar in the project's target folder.

OEG Laboratory STARS4ALL