ocr-table

This project aims to extract tables from scanned image PDFs using Optical Character Recognition.

Install Requirements

Tesseract OCR
```
sudo apt-get install tesseract-ocr
```
Imagemagick
```
sudo apt-get install imagemagick
```
PDF Utilities
```
sudo apt-get install poppler-utils
```
Python packages
```
sudo pip install -r requirements.txt
```

Usage

Clear the pdf/ folder and copy all your pdf files to be scanned in it.
Run the OCR:
```
python3 shellocr.py
```
The scanned text files shall be available in the txt/ folder once the process completes.

Alternate

If the above doesn't work for you, try the alternate method.
Save your file as input.pdf in the root directory.
Run
```
python3 pdf_miner.py 
```

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
pdf		pdf
test_cases		test_cases
txt		txt
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
extract_text.sh		extract_text.sh
input.pdf		input.pdf
output.txt		output.txt
pdf_miner.py		pdf_miner.py
py_ocr.py		py_ocr.py
requirements.txt		requirements.txt
shellocr.py		shellocr.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pdf

pdf

test_cases

test_cases

txt

txt

.gitignore

.gitignore

LICENSE

LICENSE

README.md

README.md

extract_text.sh

extract_text.sh

input.pdf

input.pdf

output.txt

output.txt

pdf_miner.py

pdf_miner.py

py_ocr.py

py_ocr.py

requirements.txt

requirements.txt

shellocr.py

shellocr.py

Repository files navigation

ocr-table

Install Requirements

Usage

Alternate

About

Releases

Packages

Languages

License

cseas/ocr-table

Folders and files

Latest commit

History

Repository files navigation

ocr-table

Install Requirements

Usage

Alternate

About

Topics

Resources

License

Stars

Watchers

Forks

Languages