Skip to content

Comparison with other PDF Table Extraction libraries and tools

Vinayak Mehta edited this page Jul 4, 2019 · 1 revision

This page of the wiki aims to compare Camelot's output (qualitatively) with other open-source libraries and tools. Chances are that you've already used one of the libraries/tools mentioned below, have had problems with getting the desired output and are here to see if Camelot can extract tables from your PDFs better.

We believe that Camelot works better than other open-source alternatives out there, we try to avoid bias though, and be fair and accurate here, by listing down advantages other tools might have over Camelot. (While also listing down steps with which Camelot makes up for them using one or more of the configuration parameters.)

We would like your help to keep this document up-to-date. If notice any inconsistency, please let us know by opening an issue.

Table of contents

The naming for parsing methods inside Camelot (i.e. Lattice and Stream) was inspired from Tabula. Lattice is used to parse tables that have demarcated lines between cells, while Stream is used to parse tables that have whitespaces between cells to simulate a table structure.

We took 10 PDFs of each type (lines, for Lattice and whitespaces between tables cells, for Stream) and passed them through Tabula's web interface and Camelot's command-line interface. The CSV outputs were pushed to this repo as is. We found that Camelot works better than Tabula in all Lattice cases. Tabula does better table detection for Stream cases, but it still fails to give good parsing output, which Camelot solves for with its configuration parameters.

Note: We have better table detection for Stream cases in the works. #102

We put a ✔️ in the "Table detected correctly?" column if the table was detected accurately and ❌ if it was not (providing an image of the detected table in both cases). The reasoning behind which output is better is provided in the "Comments" column.

Lattice

n PDF Notes Table detected correctly? Extra configuration used? Result Which has better output? Comments
Tabula Camelot Tabula Camelot Tabula Camelot
1. agstat.pdf Header text is vertical, columns span multiple cells. image ✔️ image NA No csv csv Camelot Tabula doesn't output all the header text. Camelot gets all the headers in the correct cells, albeit in reverse order in some cases.
2. background_lines_1.pdf The lines are in background. image ✔️ image NA
-back
csv csv Both
3. background_lines_2.pdf The lines are in background. ✔️ image ✔️ image NA
-scale 40
-back
csv csv Camelot Tabula shifts some of the data points towards the left. Camelot gets the table as is.
4. column_span_1.pdf Columns spans multiple cells. ✔️ image ✔️ image NA No csv csv Camelot Tabula moves some headers on the top-right to the left. Camelot gets them in the correct cells.
5. column_span_2.pdf Columns spans multiple cells. ✔️ image ✔️ image NA
-scale 40
csv csv Camelot Tabula shifts some of the data points towards the left. Camelot gets the table as is. (For ex: The number 1728)
6. electoral_roll.pdf Very unusual table. ✔️ (almost) image ✔️ image NA
-scale 40
-I 1
csv csv Camelot Tabula doesn't give an output. Camelot is able to get all text out while preserving the table structure, which is usable by cleaning after some patter matching.
7. rotated.pdf The table is rotated counter-clockwise. image ✔️ image NA No csv csv Camelot Tabula output is unusable, Camelot gets the table out as is.
8. row_span_1.pdf Rows span multiple cells. ✔️ image ✔️ image NA
-scale 40
-block 99
-const -20
csv csv Camelot Tabula shifts some of the data points towards the left. Camelot gets the table as is. Check out the totals near the bottom-right.
9. twotables_1.pdf There are two tables on a single page. ✔️ (almost) image ✔️ image NA No csv Camelot Tabula output is unusable, Camelot gets the tables out as they are.
10. twotables_2.pdf There are two tables on a single page. ✔️ image ✔️ image No Both

Stream

n PDF Notes Table detected correctly? Extra configuration used? Result Which has better output? Comments
Tabula Camelot Tabula Camelot Tabula Camelot
1. 12s0324.pdf There are two tables on a single page. ✔️ NA NA Both
2. birdisland.pdf PDF is encrypted. ✔️ NA NA csv csv Tabula Camelot detects two tables, and even though the structure is correct, duplicate strings are found in the same cells. Bug filed. #103.
3. budget.pdf ✔️ NA NA No csv csv Camelot Tabula merges the last two columns into one, Camelot gets them correctly.
4. district_health.pdf ✔️ NA NA No csv csv Camelot Tabula merges all the columns. Camelot assigns the data points to the correct cells.
5. health.pdf ✔️ NA NA No csv csv Camelot Same as above.
6. m27.pdf The text is very close. (difficult to differentiate between columns) ✔️ NA NA
-C 72,95,209,327,442,529,566,606,683
-split
csv csv Camelot Tabula merges some columns. Camelot uses its "-split" feature along with column separators to cut the text strings at those coordinates and put them in the correct cells.
7. mexican_towns.pdf ✔️ NA NA No csv csv Both
8. missing_values.pdf Two columns don't have any values. ✔️ NA NA No csv csv Camelot Tabula merges some columns, Camelot gets them correctly.
9. population_growth.pdf ✔️ NA NA No csv csv Both
10. superscript.pdf A number has another number in superscript. (Refer the 2nd column for row starting with Kerala) ✔️ NA NA
-flag
csv csv Camelot Tabula merges the superscript with the number, which doesn't matter in this case due to the decimal point but can change the number by 10x without the point. Camelot uses a configuration parameter to delimit the superscripts with
<s></s>
tags, so that they can be handled during cleaning.

5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdfplumber) and Camelot's command-line-interface.

The reasoning behind which output is better is provided in the "Comments" column.

n PDF Notes Result Which has better output? Comments
pdfplumber Camelot
1. agstat.pdf Header text is vertical, columns span multiple cells. csv csv Camelot pdfplumber messes up header text.
2. column_span_1.pdf Columns spans multiple cells. csv csv Both
3. rotated.pdf The table is rotated counter-clockwise. csv csv Camelot pdfplumber output unusable.
4. twotables_1.pdf There are two tables on a single page. csv Camelot pdfplumber doesn't identify two tables and output is unusable.
5. twotables_2.pdf There are two tables on a single page. csv Camelot pdfplumber doesn't identify two tables and output is unusable.
6. budget.pdf errored csv Camelot
7. district_health.pdf csv csv Camelot pdfplumber output unusable, merged columns.
8. health.pdf csv csv Camelot pdfplumber output unusable, merged columns.
9. mexican_towns.pdf errored csv Camelot
10. missing_values.pdf Two columns don't have any values. csv csv Camelot pdfplumber output unusable, merged columns.

The open-source development for pdftables was stopped in September 2013, when it became a closed-source paid tool.

Again, 5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdftables) and Camelot's command-line-interface.

Again, the reasoning behind which output is better is provided in the "Comments" column.

n PDF Notes Result Which has better output? Comments
pdftables Camelot
1. agstat.pdf Header text is vertical, columns span multiple cells. csv csv Camelot pdftables output unusable, merged columns.
2. column_span_1.pdf Columns spans multiple cells. csv csv Camelot pdftables output unusable, merged columns.
3. rotated.pdf The table is rotated counter-clockwise. csv csv Camelot pdftables output unusable.
4. twotables_1.pdf There are two tables on a single page. csv Camelot pdftables doesn't combine multi-line rows.
5. twotables_2.pdf There are two tables on a single page. csv Camelot pdftables output unusable, merged columns.
6. budget.pdf csv csv Camelot pdftables output unusable, merged columns.
7. district_health.pdf csv csv Camelot pdftables output unusable, merged columns.
8. health.pdf csv csv Camelot pdftables output unusable, merged columns.
9. mexican_towns.pdf csv csv Both
10. missing_values.pdf Two columns don't have any values. csv csv Camelot pdftables output unusable, merged columns.

5 PDFs of each type were used from the table above, for which Camelot required no extra configuration. Tables from the selected PDFs were parsed using this script (which uses pdf-table-extract) and Camelot's command-line-interface.

The reasoning behind which output is better is provided in the "Comments" column.

n PDF Notes Result Which has better output? Comments
pdf-table-extract (pte) Camelot
1. agstat.pdf Header text is vertical, columns span multiple cells. csv csv Both Camelot puts vertical headers in reverse order. Bug filed. [#105]
2. column_span_1.pdf Columns spans multiple cells. csv csv Camelot pte gives extra columns.
3. rotated.pdf The table is rotated counter-clockwise. csv csv Camelot pte doesn't account for table rotation.
4. twotables_1.pdf There are two tables on a single page. csv Camelot pte output unusable.
5. twotables_2.pdf There are two tables on a single page. csv Camelot pte detects one table and merges first row with header.
6. budget.pdf csv csv Camelot pte output unusable.
7. district_health.pdf csv csv Camelot pte output unusable.
8. health.pdf csv csv Camelot pte output unusable.
9. mexican_towns.pdf csv csv Camelot pte output unusable.
10. missing_values.pdf Two columns don't have any values. csv csv Camelot pte output unusable.