Use multiprocessing to parallely process PDF pages #20

vinayak-mehta · 2019-07-05T22:08:04Z

>>> camelot.read_pdf('filename.pdf', pages='all', parallel=True)

We could try and use all cores present on the machine using multiprocessing. More ideas are welcome.

satheeshkatipomu · 2019-09-20T09:37:13Z

Hi @vinayak-mehta ,

Even I thought of implementing this. dramatiq or celery are my suggestions for asynchronous processing of pages.

jontis · 2019-09-21T11:36:40Z

I'm doing this with dask but it's chosen out of habit.

selcukusta · 2019-11-07T11:33:13Z

Is there any improvement in there? I have a file that has only one page. The page has a table (25 rows x 13 columns). read_pdf function takes 10 seconds after that to_excel takes only 100-150 ms. I'm thinking about 10 seconds is too long, am I wrong?

NixBiks · 2019-11-11T12:03:18Z

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

vinayak-mehta · 2019-11-11T15:29:41Z

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

Yes!

But does anyone have a solution for multiple pages in parallel?

Using multiprocessing, we should be able to distribute multiple pages on all cores, processing them in parallel.

NixBiks · 2019-11-11T15:32:10Z

I get this though

objc[53475]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.

Oh; what is the difference between https://github.com/atlanhq/camelot and https://github.com/camelot-dev/camelot ? Didn't notice two repos before now...

selcukusta · 2019-11-12T10:42:53Z

@selcukusta I think this is more about running multiple pages at the same time rather than speeding up the extraction of a single page.

But does anyone have a solution for multiple pages in parallel?

Yeah, I know. Actually it's related with that but the issue was closed and referenced to it.

rawsh-bt · 2020-06-26T17:08:22Z

Does anyone have an update? I've tried inheriting PageHandler and making pages multithreaded / multicore, and multi threading processing multiple pdfs, but I'm running into a ghostscript error (seems like it's not thread safe?)

phoewass · 2020-09-02T00:46:39Z

I did implement a multi-threading layer above camelot.read_pdf using multiprocessing library.
I faced a couple of pitfalls doing it, so I can help on this if I may.

vinayak-mehta · 2020-10-12T15:52:18Z

@phoewass That would be awesome if you're still interested!

RickyGunawan09 · 2021-04-28T04:06:28Z

can anyone tell me how to use multiprocess in camelot ? or did this issues still on progress ?

phoewass · 2021-05-01T15:02:49Z

Hi all. Sorry it took me a while to publish the PR while the code was already available.
Now the PR is there to be reviewed, I'm looking forward for your feedback.

vinayak-mehta · 2021-06-14T20:27:13Z

👀

Siddharth1India · 2023-05-23T05:52:18Z

Any update on this? My PDFs are 100s of pages and I can really use this feature.

mlbrothers · 2023-06-19T09:52:49Z

@phoewass @vinayak-mehta is this feature part of library now? If not, is there any way I can utilize multiprocessing to read multipage PDF?

deepakagrawal · 2024-03-21T18:18:47Z

Any updates on this features?

bosd · 2024-03-22T06:52:31Z

Hey!

As camelot is dead, we try to build a maintained fork at pypdf_table_extraction.

There is a discussion about this in:

py-pdf#8 (reply in thread)

vinayak-mehta added the enhancement New feature or request label Jul 5, 2019

This was referenced Jul 5, 2019

Using multithreading to extract tables from a large PDF atlanhq/camelot#347

Closed

Use of cores in camelot atlanhq/camelot#269

Closed

Performance improvement for fixed table Pdfs atlanhq/camelot#301

Closed

vinayak-mehta added this to Backlog in TODO! Jul 9, 2020

vinayak-mehta moved this from Backlog to To do in TODO! Jul 9, 2020

mohankumargx mentioned this issue Feb 23, 2021

camelot.ext.ghostscript._gsprint.GhostscriptError: -100 #123

Open

phoewass mentioned this issue May 1, 2021

[WIP] Add support for parsing PDF pages in parallel #237

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multiprocessing to parallely process PDF pages #20

Use multiprocessing to parallely process PDF pages #20

vinayak-mehta commented Jul 5, 2019

satheeshkatipomu commented Sep 20, 2019

jontis commented Sep 21, 2019

selcukusta commented Nov 7, 2019 •

edited

NixBiks commented Nov 11, 2019

vinayak-mehta commented Nov 11, 2019

NixBiks commented Nov 11, 2019 •

edited

selcukusta commented Nov 12, 2019

rawsh-bt commented Jun 26, 2020 •

edited

phoewass commented Sep 2, 2020

vinayak-mehta commented Oct 12, 2020

RickyGunawan09 commented Apr 28, 2021

phoewass commented May 1, 2021

vinayak-mehta commented Jun 14, 2021

Siddharth1India commented May 23, 2023

mlbrothers commented Jun 19, 2023

deepakagrawal commented Mar 21, 2024

bosd commented Mar 22, 2024

Use multiprocessing to parallely process PDF pages #20

Use multiprocessing to parallely process PDF pages #20

Comments

vinayak-mehta commented Jul 5, 2019

satheeshkatipomu commented Sep 20, 2019

jontis commented Sep 21, 2019

selcukusta commented Nov 7, 2019 • edited

NixBiks commented Nov 11, 2019

vinayak-mehta commented Nov 11, 2019

NixBiks commented Nov 11, 2019 • edited

selcukusta commented Nov 12, 2019

rawsh-bt commented Jun 26, 2020 • edited

phoewass commented Sep 2, 2020

vinayak-mehta commented Oct 12, 2020

RickyGunawan09 commented Apr 28, 2021

phoewass commented May 1, 2021

vinayak-mehta commented Jun 14, 2021

Siddharth1India commented May 23, 2023

mlbrothers commented Jun 19, 2023

deepakagrawal commented Mar 21, 2024

bosd commented Mar 22, 2024

selcukusta commented Nov 7, 2019 •

edited

NixBiks commented Nov 11, 2019 •

edited

rawsh-bt commented Jun 26, 2020 •

edited