[WIP] Add support for parsing PDF pages in parallel #237

phoewass · 2021-05-01T14:58:39Z

Closes #20

Parse pages in parallel using multiprocessing library leveraging all the available CPUs.

Checklist:

Process in parallel using the library
Tests to process with and without parallel option
Process in parallel using the CLI
Update documentation

codecov-commenter · 2021-05-01T15:09:37Z

Codecov Report

Merging #237 (63161fe) into master (7709e58) will increase coverage by 0.07%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master     #237      +/-   ##
==========================================
+ Coverage   88.35%   88.42%   +0.07%     
==========================================
  Files          14       14              
  Lines        1571     1581      +10     
  Branches      358      359       +1     
==========================================
+ Hits         1388     1398      +10     
  Misses        128      128              
  Partials       55       55

Impacted Files	Coverage Δ
camelot/io.py	`100.00% <ø> (ø)`
camelot/handlers.py	`91.66% <100.00%> (+0.96%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7709e58...63161fe. Read the comment docs.

vinayak-mehta · 2021-06-14T20:27:42Z

Why do we need to pass parallel as an argument to each test?

phoewass · 2021-06-14T20:41:03Z

I did this to avoid copying the tests and make sure that the new argument parallel does not break the API.

Since the existing code overwrites `layout` and `dim` in each iteration, it is much more efficient to simply return the `layout` and `dim` of the first page. I have tested the difference with a 455 page pdf and the optimisation reduces the time spent from 50 to 5 seconds. Signed-off-by: Karl Bonde Torp <k.torp@samsung.com>

maxdd · 2022-05-06T08:15:37Z

Will this ever be merged or is camelot already mp?

jgcmarins · 2022-09-30T21:33:28Z

What is missing for this to be merged?

hashangayasri · 2022-10-11T03:25:47Z

Is this still WIP? What's preventing this from being merged?

MartinThoma · 2023-03-11T12:41:42Z

Currently, there are merge conflicts that first need to be resolved.

#353 might also get merged and that makes me uncertain how to continue with all others

MartinThoma · 2024-02-25T11:14:36Z

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

Do you want to open the PR against that branch so that we can merge your improvement?

[MRG] Utils: optimise get_page_layout

• Installing authlib (1.3.0) • Installing marshmallow (3.21.0) • Installing pydantic (1.10.14) • Installing safety-schemas (0.0.2) • Installing typer (0.9.0) • Removing gitdb (4.0.10) • Removing gitpython (3.1.37) • Removing smmap (5.0.0) • Updating attrs (23.1.0 -> 23.2.0) • Updating babel (2.12.1 -> 2.14.0) • Updating bandit (1.7.5 -> 1.7.7) • Updating beautifulsoup4 (4.12.2 -> 4.12.3) • Updating black (23.7.0 -> 24.2.0) • Updating certifi (2023.7.22 -> 2024.2.2) • Updating cffi (1.15.1 -> 1.16.0) • Updating cfgv (3.3.1 -> 3.4.0) • Updating chardet (5.1.0 -> 5.2.0) • Updating charset-normalizer (3.2.0 -> 3.3.2) • Updating click (8.1.5 -> 8.1.7) • Updating contourpy (1.1.0 -> 1.1.1) • Updating coverage (7.2.7 -> 7.4.3) • Updating cryptography (41.0.4 -> 42.0.5) • Updating cycler (0.11.0 -> 0.12.1) • Updating distlib (0.3.6 -> 0.3.8) • Updating dparse (0.6.3 -> 0.6.4b0) • Updating filelock (3.12.4 -> 3.13.1) • Updating fonttools (4.41.0 -> 4.49.0) • Updating furo (2023.9.10 -> 2024.1.29) • Updating identify (2.5.29 -> 2.5.35) • Updating idna (3.4 -> 3.6) • Updating isort (5.12.0 -> 5.13.2) • Updating jinja2 (3.1.2 -> 3.1.3) • Updating markupsafe (2.1.3 -> 2.1.5) • Updating matplotlib (3.7.2 -> 3.7.5) • Updating mypy (1.4.1 -> 1.8.0) • Updating opencv-python (4.8.1.78 -> 4.9.0.80) • Updating packaging (23.1 -> 23.2) • Updating pathspec (0.11.1 -> 0.12.1) • Updating pbr (5.11.1 -> 6.0.0) • Updating pillow (10.0.0 -> 10.2.0) • Updating platformdirs (3.8.1 -> 4.2.0) • Updating pluggy (1.2.0 -> 1.4.0) • Updating pre-commit (3.4.0 -> 3.5.0) • Updating pre-commit-hooks (4.4.0 -> 4.5.0) • Updating pygments (2.15.1 -> 2.17.2) • Updating pyparsing (3.0.9 -> 3.1.1) • Updating pypdf (3.12.1 -> 3.17.4) • Updating pytest (7.4.0 -> 8.0.2) • Updating pytz (2023.3 -> 2024.1) • Updating pyyaml (6.0 -> 6.0.1) • Updating rich (13.4.2 -> 13.7.0) • Updating ruamel-yaml (0.17.32 -> 0.18.6) • Updating ruamel-yaml-clib (0.2.7 -> 0.2.8) • Updating safety (2.3.4 -> 3.0.1) • Updating setuptools (68.0.0 -> 69.1.1) • Updating soupsieve (2.4.1 -> 2.5) • Updating sphinx (7.0.1 -> 7.1.2) • Updating sphinx-click (4.4.0 -> 5.1.0) • Updating stevedore (5.1.0 -> 5.2.0) • Updating tokenize-rt (5.1.0 -> 5.2.0) • Updating tornado (6.3.3 -> 6.4) • Updating typeguard (4.0.0 -> 4.1.5) • Updating typing-extensions (4.7.1 -> 4.10.0) • Updating urllib3 (2.0.3 -> 2.2.1) • Updating virtualenv (20.24.0 -> 20.25.1) • Updating xdoctest (1.1.1 -> 1.1.3)

Fix situation where pdftopng is not found if executing python directly from an un-activated environment.

Fix safety issues by update lockfile

Parse in parallel using multiprocessing library using available CPUs

phoewass · 2024-03-29T03:54:14Z

Moved to py-pdf#17

FLIT_PASSWORD as repo secret

b68c489

foarsitter and others added 4 commits February 28, 2024 08:38

Merge pull request camelot-dev#5 from karlowich/optimise-get_page_layout

706ea1a

[MRG] Utils: optimise get_page_layout

Poppler backend: search for pdftopng in current environment

1b5a621

Fix situation where pdftopng is not found if executing python directly from an un-activated environment.

Merge pull request camelot-dev#7 from py-pdf/update_lockfile

567520b

Fix safety issues by update lockfile

phoewass force-pushed the feature/parallel branch 2 times, most recently from 428cb18 to a06796b Compare March 29, 2024 03:38

phoewass added 4 commits March 29, 2024 04:42

Add support for parsing PDFs in parallel

cf3b809

Parse in parallel using multiprocessing library using available CPUs

Add support for parallel processing in CLI

95efc2a

Add tests

5aa4d27

Update docs

e3cd4d9

phoewass force-pushed the feature/parallel branch from a06796b to e3cd4d9 Compare March 29, 2024 03:44

phoewass closed this Mar 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add support for parsing PDF pages in parallel #237

[WIP] Add support for parsing PDF pages in parallel #237

phoewass commented May 1, 2021

codecov-commenter commented May 1, 2021

vinayak-mehta commented Jun 14, 2021

phoewass commented Jun 14, 2021

maxdd commented May 6, 2022 •

edited

jgcmarins commented Sep 30, 2022

hashangayasri commented Oct 11, 2022

MartinThoma commented Mar 11, 2023

MartinThoma commented Feb 25, 2024

phoewass commented Mar 29, 2024

[WIP] Add support for parsing PDF pages in parallel #237

[WIP] Add support for parsing PDF pages in parallel #237

Conversation

phoewass commented May 1, 2021

codecov-commenter commented May 1, 2021

Codecov Report

vinayak-mehta commented Jun 14, 2021

phoewass commented Jun 14, 2021

maxdd commented May 6, 2022 • edited

jgcmarins commented Sep 30, 2022

hashangayasri commented Oct 11, 2022

MartinThoma commented Mar 11, 2023

MartinThoma commented Feb 25, 2024

phoewass commented Mar 29, 2024

maxdd commented May 6, 2022 •

edited