Table Extraction from the PDF

Introduction

The objective of this project is to extract tables and its cells from a PDF using python library camelot.

Note : Camelot works better if boundaries of each cell are properly defined. It means that any two cells are separated with a solid line.

Detection of Outer Boundary

Table extraction from a pdf can be done by a process called Lattice. Below are the steps which it take to identify table region.

Converts PDF into image using Ghostscript
Image processing to get Horizontal and Vertical Lines
Line segements are detected
Table boundaries are computed by overlapping the detected line segments by “or”ing their pixel intensities.

The image below shows the detected outer lines of a table --

Detection of Cell Boundaries

Intersection points of horizontal and vertical lines are identified by Image Processing techniques and these points will be the coordinates for each cell given in the table. But, all these coordinates will be in camelot space because this library reduces the size of pdf before processing it. Hence, it is necessary to shift these coordinates from camelot space to original PDF space.

Now, this transformation can be easily done by shifting and rescaling of axes (Cartesian Coordinate System) in camelot space. If top-left coordinate of table is considered as origin for both the spaces. Then, the following approach can be used -

Shifting of top-left coordinate of table_c (table in camelot space) to top_left coordinate of table_p (table in PDF space)
Calculate the rescaling factor for width and height. This will be the ratio of widths and heights of both the tables (ratio > 1)
For each cell in camelot space, multiply height and width of cell with their respective scaling factors

For example --

red : Table in PDF space
purple : Table in camelot space.

Transformation equations for x and y coordinates ---

The image below is the table transformed from camelot space to pdf space.

Usage

Install requirements

pip install -r requirements.txt

Install Ghostscript from here
Implementation done in jupyter notebook and notebook can be found here

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
codes		codes
jupyter		jupyter
.gitignore		.gitignore
README.md		README.md
cell.png		cell.png
diagram.png		diagram.png
equation.png		equation.png
outer.png		outer.png
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

codes

codes

jupyter

jupyter

.gitignore

.gitignore

README.md

README.md

cell.png

cell.png

diagram.png

diagram.png

equation.png

equation.png

outer.png

outer.png

requirements.txt

requirements.txt

Repository files navigation

Table Extraction from the PDF

Introduction

Detection of Outer Boundary

Detection of Cell Boundaries

Usage

About

Releases

Packages

Languages

rajatvajpayee/table-extraction-from-PDF

Folders and files

Latest commit

History

Repository files navigation

Table Extraction from the PDF

Introduction

Detection of Outer Boundary

Detection of Cell Boundaries

Usage

About

Topics

Resources

Stars

Watchers

Forks

Languages