GitHub - gawati/pdf-to-xml: PDF to XML converter

This is a fork of the pdfminer tool, with a specific focus on extracting semantic XML out of OCR-ed PDF.

It extracts pdf content page by page, and also identifies words and lines using distinct tags.

Installation

python lc_setup.py install

You can also install it within a virtualenv.

Running

python lc_pdf2txt.py

Provides various options, of interest to us are XML specific options which have been added:

-B make_brief

Which disables character level font glyphs if that is too verbose for you.

-t xml

Outputs XML

lc_pdf2txt.py -B -t xml -o test.xml ./akn_mu_act_1923-10-13_act_14-1923_eng_main.pdf

will convert akn_mu_act_1923-10-13_act_14-1923_eng_main.pdf to test.xml.

We typically don't need character level font-glyphs

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
.idea		.idea
cmaprsrc		cmaprsrc
pdfminer		pdfminer
.gitignore		.gitignore
Jenkinsfile		Jenkinsfile
LICENSE.txt		LICENSE.txt
README.md		README.md
lc_pdfengine.iml		lc_pdfengine.iml
lc_setup.py		lc_setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

.idea

.idea

cmaprsrc

cmaprsrc

pdfminer

pdfminer

.gitignore

.gitignore

Jenkinsfile

Jenkinsfile

LICENSE.txt

LICENSE.txt

README.md

README.md

lc_pdfengine.iml

lc_pdfengine.iml

lc_setup.py

lc_setup.py

Repository files navigation

Installation

Running

About

Releases

Packages

Contributors 3

Languages

License

gawati/pdf-to-xml

Folders and files

Latest commit

History

Repository files navigation

Installation

Running

About

Resources

License

Stars

Watchers

Forks

Languages