[parsers.stream] - Fix IndexError when extracting more tables than there are columns #112

JosePVB · 2020-02-01T01:35:55Z

The changes in this PR are based on the conversation on atlanhq/camelot#357, but does not address the enhancement on the open issue that came out of the previous issue; #50

What this PR addresses is the challenge in using the Stream parser to extract tables out of a PDF with known, consistent table structures of interest to the caller, but that may be variable in height, starting position, or number, especially within an automated or programmatic context.

The Stream class would raise an IndexError when the 'columns' argument was specified and the number of tables identified was larger than the number of items in the 'columns' argument. This IndexError makes extracting tables from a PDF comprised mainly of known, consistent table structures of interest to the caller, but that may be variable in height, starting position, or number, rather cumbersome with the Stream parser. This is especially true within an automated or programmatic context. Either the caller must call 'camelot.read_pdf' once per page, or manipulate the 'columns' argument so as to avoid the IndexError. The former isn't guaranteed to work, as a single page can contain multiple tables, and therefore, in such a situation, the caller must resort to the latter even if extracting tables from a single page. The Stream class continues to function exactly the same when the 'table_areas' argument is provided; this commit only changes the behavior of the Stream parser when 'table_areas' is not provided. This commit allows all tables to be easily extracted by specifying 'pages=all' and providing the appropriate 'columns' argument value to 'camelot.read_pdf'. Extracting all tables from such a PDF is already possible with the Lattice parser, this commit makes this possible with the Stream parser as well. Callers are responsible for filtering out any extraneous tables.

JosePVB · 2020-02-01T02:02:11Z

@vinayak-mehta Please provide guidance on what I should do about the failed checks. For DeepSource, two of the issues are not created by this PR and the other two are due to my decision to keep internal consistency within the files that I added to.

With regards to the failed Travis CI build, existing tests on the most recent upstream/master branch were failing for me locally prior to any modifications. Here is my development environment.

>>> import platform; print(platform.platform())
Linux-4.15.0-74-generic-x86_64-with-Ubuntu-18.04-bionic
>>> import sys; print('Python', sys.version)
Python 3.6.9 (default, Nov  7 2019, 10:44:02) 
[GCC 8.3.0]
>>> import numpy; print('NumPy', numpy.__version__)
NumPy 1.18.1
>>> import cv2; print('OpenCV', cv2.__version__)
OpenCV 4.1.2
>>> import camelot; print('Camelot', camelot.__version__)
Camelot 0.7.3

The test added in this PR passed in all the Travis CI build minus the Python 2.7 version. I do not see how the failed tests on the other Python builds could have failed as a result of the changes of this PR. Therefore, I highly appreciate your guidance on these matters.

Besides those two pending issues, I feel that the changes are ready for review.

vinayak-mehta · 2020-03-21T13:30:24Z

Thanks for the PR! I'll review this tomorrow.

MartinThoma · 2024-02-25T11:16:29Z

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

bmax · 2024-02-28T06:57:27Z

This issue fixed would be huge. I will move to pypdf_table_extraction if this works out.

JosePVB added 2 commits January 31, 2020 20:29

Add test for extracting all tables with flavor=stream

0962b8f

JosePVB requested a review from vinayak-mehta February 1, 2020 02:02

vinayak-mehta added this to Backlog in TODO! Jul 9, 2020

vinayak-mehta moved this from Backlog to To do in TODO! Jul 9, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[parsers.stream] - Fix IndexError when extracting more tables than there are columns #112

[parsers.stream] - Fix IndexError when extracting more tables than there are columns #112

JosePVB commented Feb 1, 2020

JosePVB commented Feb 1, 2020

vinayak-mehta commented Mar 21, 2020

MartinThoma commented Feb 25, 2024

bmax commented Feb 28, 2024

[parsers.stream] - Fix IndexError when extracting more tables than there are columns #112

Are you sure you want to change the base?

[parsers.stream] - Fix IndexError when extracting more tables than there are columns #112

Conversation

JosePVB commented Feb 1, 2020

JosePVB commented Feb 1, 2020

vinayak-mehta commented Mar 21, 2020

MartinThoma commented Feb 25, 2024

bmax commented Feb 28, 2024