Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[MRG] Add saturation threshold option for low contrast tables #203

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

NoReflex
Copy link

I found out this problem while trying to parse a table with low contrast background color. The -back option didn't work for low contrast areas such as the last row. So, I've added a new option -color (--process_color_background) which increases the contrast to guarantee accurate table parsing.

Here's camelot (master) result:
example_camelot_master

Here's my branch with -color option enabled:
example_camelot_branch

As you can see, we add another step which is basically a binary threshold for low saturation vs no saturation.
Now the borders are way more pronounced and camelot has no issue detecting all the rows.

@NoReflex NoReflex changed the title Add saturation threshold option for low contrast tables [MRG] Add saturation threshold option for low contrast tables Oct 23, 2020
@codecov-io
Copy link

Codecov Report

Merging #203 into master will decrease coverage by 0.60%.
The diff coverage is 21.42%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #203      +/-   ##
==========================================
- Coverage   88.26%   87.65%   -0.61%     
==========================================
  Files          14       14              
  Lines        1542     1555      +13     
  Branches      350      351       +1     
==========================================
+ Hits         1361     1363       +2     
- Misses        127      137      +10     
- Partials       54       55       +1     
Impacted Files Coverage Δ
camelot/io.py 100.00% <ø> (ø)
camelot/utils.py 81.26% <ø> (ø)
camelot/image_processing.py 82.14% <8.33%> (-12.38%) ⬇️
camelot/cli.py 86.77% <100.00%> (+0.11%) ⬆️
camelot/parsers/lattice.py 94.14% <100.00%> (+0.03%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d17dc43...9161ef3. Read the comment docs.

@vinayak-mehta
Copy link
Member

@NoReflex Thanks for the PR! The results look great! I don't have enough image processing and opencv background so will have to read up on the cv2.COLOR_BGR2HSV option. I also have a question, do you think this new code could also handle other background line cases? That way we could just add it as an enhancement to the earlier option instead of creating a new one.

@vinayak-mehta vinayak-mehta added this to To do in TODO! Oct 25, 2020
@NoReflex
Copy link
Author

NoReflex commented Oct 26, 2020

@vinayak-mehta The cv2.COLOR_BGR2HSV is just a colorspace transformation from RGB to HSV (Hue, Saturation, Value).
As for the question, this would fail if the table's cell colors are gray/colorless. That's why it's an option.
Technically it's still an enhancement, because the -color flag can only be used with -back, but yeah, I get your point and I think it's better if it's a separate option for handling edge cases.

EDIT: By failing, I just mean that the result will be worse than using the plain option, the code is pretty bulletproof, it's just some simple numpy array transformations.

@MartinThoma
Copy link
Contributor

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
TODO!
  
To do
Development

Successfully merging this pull request may close these issues.

None yet

4 participants